Automatically Annotating Text with Linked Open Data

Size: px

Start display at page:

Download "Automatically Annotating Text with Linked Open Data"

Nora Gilmore
5 years ago
Views:

1 Automatically Annotating Text with Linked Open Data Delia Rusu, Blaž Fortuna, Dunja Mladenić Jožef Stefan Institute

2 Motivation: Annotating Text with LOD Open Cyc DBpedia WordNet

3 Overview Related work Algorithms for annotating with LOD PageRank Context Similarity Evaluation Datasets WordNet OpenCyc DBpedia Conclusions and Future Work

4 Related Work Supervised approaches: Parallel corpora: Chan et al. SemEval 2007 Knowledge-based: WordNet::Similarity package Pedersen et al Usage of context free grammars to validate semantic interconnections Navigli and Velardi, 2005 Formal document structure description, hypothesis building, trying to reason using Cyc Curtis et al Disambiguate Wikipedia articles into Cyc concepts - Medelyan and Legg, 2008 Adapted versions of PageRank Mihalcea et al. 2004, Agirre and Soroa, 2009 Simple knowledge-based approaches compete with state-of-the-art supervised approaches using a highquality knowledge base - Ponzetto and Navigli, 2010.

5 LOD Dataset Representation WordNet (VUA)

6 LOD Dataset Representation rdf:resource rdf:type rdfs:subclassof Example: rdf:resource=" wordsense-values-noun-1" rdf:type rdf:resource=" synset-belief-noun-1"

7 Algorithms: PageRank candidate resource for a word belonging to the text fragment Example for the word values (human readable description of the resource) beliefs of a person or social group in which they have an emotional investment (either for or against something); "he has very conservatives values (an ideal accepted by some individual or group) "he has old-fashioned values ((music) the relative duration of a musical note)

8 Algorithms: PageRank Algorithm steps: set the graph vertices to either of the values 0, if the vertex does not represent a candidate resource, or 1/R, with R being the total number of candidate resources the PageRank value for each vertex i (PR[Vi]) is:

9 Algorithms: ContextSimilarity In a country as diverse and complex as India, it is not surprising to find that people here reflect the rich glories of the past, the culture, traditions and values relative to geographic locations and the numerous distinctive manners, habits and food that will always remain truly Indian. candidate resource description beliefs of a person or social group in which they have an emotional investment (either for or against something); "he has very conservatives values (an ideal accepted by some individual or group) "he has oldfashioned values ((music) the relative duration of a musical note) candidate neighborhood resource description belief: (any cognitive content held as true) ideal: (the idea of something that is perfect; something that one hopes to attain) duration, continuance: (the period of time during which something continues)

10 Algorithms: ContextSimilarity values ContextSimilarity (resource, w a ) returns Similarity Similarity = 0 NR = GetNeighborhoodResources(resour ce) CW = GetContext(w a ) for i = 1 to Size(NR) do CS = sim cos (NR[i], CW) Similarity = Similarity + CS end for return Similarity beliefs of a person or social group in which they have an emotional investment (either for or against something); "he has very conservatives values ((music) the relative duration of a musical note) (an ideal accepted by some individual or group) "he has old-fashioned values

11 Evaluation Datasets Expert annotators WordNet: SemEval 2007 word sense disambiguation Task 7: Course Grained English All Words annotated words, 1591 polysemous (WordNet) OpenCyc: a subset comprised of 50 words from the SemEval 2007 Task 7 corpus, with more than one candidate resource, which were manually annotated by 2 annotators DBpedia: a subset of 56 words from the SemEval 2007 Task 7 corpus, with more than one candidate resource, which were manually annotated by 1 annotator Crowdsourcing annotators WordNet and OpenCyc: A subset of 325 words for WordNet, 177 for OpenCyc, from the SemEval 2007 Task 7 corpus

12 Evaluation Results Expert Annotators Algorithm WordNet OpenCyc DBpedia CS PR Random Crowdsourcing Annotators Algorithm WordNet OpenCyc CS

13 Conclusions We investigated the applicability of two common approaches, taken from the word sense disambiguation community, for annotating text with LOD datasets: relying on the dataset relationship structure (PageRank) taking advantage of the human-readable description of a resource as well as neighborhood relationships defined for that resource (ContextSimilarity) Three datasets: WordNet, DBpedia and OpenCyc. Experiments revealed the shortcomings of the current state-of-the-art word sense disambiguation methods when applied to different LOD datasets

14 Conclusions WordNet OpenCyc DBpedia dictionary-based taxonomy common-sense knowledge base primarily developed for modeling and reasoning an effort to extract structured information from Wikipedia Purpose for which the dataset was developed highest ratio of covered words candidate resources correspond directly with the possible word meanings distinctions between resources depend on the reasoning task contains concepts which are created to support specific tasks (reasoning, paraphrasing, etc.) rich set of instances (named entities: places, people, and organizations) few common words covered named entities which have common words (e.g. "Talk" is a song by the British alternative rock band Coldplay)

15 Conclusions WordNet OpenCyc DBpedia Humanreadable descriptions Relations between resources written similar to dictionary entries; also contain examples easier to understand by the general public most relation types are defined between the same parts of speech useful relations between different parts of speech are missing documentation to the ontology engineer using it to model some world phenomena hard to understand by the general public written like encyclopedia entries very short for some resources contain infrastructure relationships (e.g. wikipageusestemplate is the most common relation in DBpedia infobox triplets) should be disregarded as they introduce noise

16 Future Work Further develop text annotation methods which can offer better performance on datasets, such as OpenCyc and DBpedia, and can be transferred to other datasets Investigate the potential for combining resources from different datasets in the same task Include elements of active learning having users in the loop provide a few annotations, to enhance the discovery of hard to disambiguate text fragments acquire labeled data for performing algorithm optimization

17 Thank You for Your Attention!

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li

Random Walks for Knowledge-Based Word Sense Disambiguation Qiuyu Li Word Sense Disambiguation 1 Supervised - using labeled training sets (features and proper sense label) 2 Unsupervised - only use unlabeled