Semantic Enrichment of Places for the Portuguese Language

Size: px
Start display at page:

Download "Semantic Enrichment of Places for the Portuguese Language"

Transcription

1 Semantic Enrichment of Places for the Portuguese Language Jorge O. Santos 1, Ana O. Alves 1,2, Francisco C. Pereira 1, Pedro H. Abreu 1 1 CISUC, Centre for Informatics and Systems of the University of Coimbra, Portugal jasant@student.dei.uc.pt,{ana,camara,pha}@dei.uc.pt 2 ISEC, Coimbra Institute of Engineering, Portugal aalves@isec.pt Abstract. A large amount of descriptive information about places is widely spread on the web, provided by services such as commercial directories or social networks in a crowd sourced way, which can lead us to classify a particular location based on the atomic entities comprised on it: points of interest. However, different services provide different types of information and currently there is no easy way to unify this kind of data to get more descriptive information than the classic place representation: a pair of coordinates and a name. In this article, we present a platform which is able to semantically enrich places. This can be achieved by integrating multiple data sources and extracting meaningful terms to automatically label places applying Natural Language Processing and Information Extraction techniques on contents fetched from the web, being such terms validated on knowledge bases like Wikipedia or Wiktionary. The approach focus the Portuguese language, aiming to be part of a semantic data gateway for TICE.Mobilidade, a mobility platform. Keywords: Semantics of Place, Information Extraction, Automatic Tagging, Natural Language Processing 1 Introduction In this paper we present our approach on developing a dynamic place enrichment platform for the Portuguese language, being able to continuously gather place information among multiple sources like location-enabled social networks or commercial directories. The approach is inspired on KUSCO[1], a system that extracts a ranked list of concepts having a set of webpages containing place information as input for the English language. This work aims to evolve KUSCO into a dynamic and robust platform, allowing the integration different information sources and applying its methodology to the Portuguese language. By assigning semantic annotations to places, we are able to characterize locations based on the points of interest (POIs) that comprise them, an useful resource to external services such as a tour planner that chooses places to visit based on the user preferences. This approach is currently being implemented as a module of TICE.Mobilidade, a project whose vision stands by the creation of a digital 408

2 platform providing mobility services on Portuguese territory. The module will be responsible for transforming raw information originally supplied by multiple data providers into valuable knowledge for other modules, which offer a wide range of services for the end-user. Besides the semantic enrichment service (detailed in this paper), the referred module also comprises two other services: one responsible for storing knowledge (Semantic Interoperability service) and another one providing a recommendation system based on the user habits (Artificial Selective Attention service). The next section aims to contextualize the author with some essential background, describing related work on the area. On the third section the proposed approach is specified, describing the key points of the semantic enrichment process, focusing the architectural approach as well, where each module is described based on its contribute for the approach. Finally, we present some experiments and discuss the validation of the results obtained on it. 2 Background The rising popularity of location-based social networks and other communitybased data sources is a relevant information source for place-related data: Containing information like recommendations about a particular dish on a restaurant, a show that is usually performed on a concert site or even transportationrelated information about how to get to a particular venue. Many authors started to take advantage of such resources in order to create useful services by mining this type of information. KUSCO[1] (Knowledge Unsupervised Search for populating Concepts on Ontologies), the main inspiration for the approach proposed in this paper, allows the annotation of places and events by extracting meaningful concepts from online texts, enriching these with semantic information. The system receive a POI source as input (a collection of documents, a directory or an API) which is mapped on the conceptual data model, automatically populating a POI database. Based on the data collected, a document search about each POI is performed exploring the World Wide Web and Wikipedia in four different ways. The web is explored in two distinct methods: open and focused. Wikipedia is used in two different ways, to locate generic and specific information about places. These search methods are named Perspectives of Semantic Enrichment. From the documents retrieved from these data sources, a bag of relevant concepts (semantic indexes) is extracted, being each concept weighted using statistical relevance metrics. Dearman and Truong[2] used reviews from Yelp s authored community to identify activities supported by a certain location. By harvesting review pages and mining its content with information extraction and natural language processing techniques like tokenization and part-of-speech (POS) tagging, the authors were able to extract verb-noun pairs from the reviews representing possible activities which can be performed at a particular location, like eat pizza or order latte, validating the use of community-authored content to identify activities. 409

3 A.N. Alazzawi et al.[3] proposed a methodology to identify concepts of services offered on a particular place and activities that can be performed in there. By detecting language patterns of place affordance on POS tagged texts from collections of services types, the authors apply these patterns to other resources present on the web in order to retrieve service and activity concepts. This way, they are able to contextualize places giving particular relevance to the activities performed on such location (e.g.: Church has associated concepts like Worship, ChristianWorship or HoldChristianServices ). Topica[4] is a web application that using Facebook Places API retrieves POIs in a given area, relating each one to its corresponding fan page, in order to extract user comments from it. These comments are enriched using external resources such as OpenCalais[5] and Zemanta 3, services that are able to extract entities, keywords and related pages from a supplied text. This data is used to model the comment as a DBpedia 4 resource and queries DBpedia in order to extract a set of categories from it. Each category gets a weight based on TF-IDF (Term- Frequency x Inverse Document Frequency) calculation, regarding the contents present on the current fan page and those already retrieved. Finally it presents the collected data on a map interface, where the user can access the retrieved data in order to get textual and multimedia contents. Our approach shares some aspects with the works of Dearman and Truong and A.N. Alazzawi et al., but like with KUSCO, we are more focused on the meaning of each place, gathering information around a dynamic set of sources and generating terms that can be considered meaningful to a given place. The main innovations comparing our approach to KUSCO are the dynamic handling of data sources, the support for the Portuguese language and its availability through a robust platform. 3 Approach This approach is organized following the platform s architectural structure, being here described how each module works and how it relates to the others. Each module is responsible for a specific set of tasks, being these described regarding the role played on the platform. Figure 1 presents an overview on the whole semantic enrichment process. 3.1 Enrichment Source Extraction The process is triggered when a list of POIs is received, being each POI s information sent to a set of extractor scripts, one per information source. These scripts receive a text query, a pair of coordinates and a distance radius as input (e.g.: ( Torre de Belém, [ , ], 500)), in order to retrieve points of interest contained on such location. The data extracted here comprises information like descriptions, user reviews, usage statistics (measured by some social networks, in terms of user interaction), contacts and even directions to 3 More information on 4 More information on 410

4 POI's Web Data Collection over multiple enrichment sources Enrichment Sources Integration and duplicate detection Part-of-Speech Tagging Noun-Phrase Chunking Named Entity Recognition PAPEL Contextualization and Term Validation Wikipedia Relevance Computing Wiktionary Semantic Concepts Fig. 1. Semantic Enrichment Model a place using public transportation. The existing extractors are spawned in a thread pool with the provided parameters, being executed concurrently. This initial phase may achieve a set of POIs, being these handled to the integration module as a standard response, using the JSON 5 format with all the extracted data properly structured. This input/output standard (defined as an interface pattern on the system) allows us to customize these submodules, making the platform able to interact with new extractors when the need of using new data sources arise. 3.2 Resource Integration The data fetched by the extractors is now submitted to the integration module, where each fetched POI is analyzed in order to verify if we are dealing with the same entity. Since we are handling multiple information sources, there is a high probability of finding duplicate POIs on the process. We take advantage of such duplication to collect complementary data that did not exist on the original POI. 5 JavaScript Object Notation 411

5 In order to detect duplicates, the process goes through the attributes present on the POI, checking essentially how similar are the names and websites (when available) between multiple POIs. In order to compute this similarity, we make use of an open-source library[6] which provides a wide range of string similarity algorithms[7]. We consider two POIs as duplicate candidates if they fit in the following groups: The distance between both POIs is less than 100 meters, the name similarity is above 0.9 and no website information is available in one or both POIs. The distance between both POIs is less than 100 meters, the name similarity is above 0.7 and the best match among the available websites 6 above 0.5. These thresholds were obtained as result of an iterative process, starting with lower values and analyzing the detected duplicates to verify if the matches were acceptable. Being the first group only based on two attributes, we had to set the name similarity threshold to a higher level to assure a low rate of false positives. If a duplicate is detected and it brings new information for the older resource, both resources are merged. However, since some sources are essentially composed by user-generated content, we need to be aware of how reliable can an information source be. This process should be carefully performed, since there can be overlapping information (e.g. name or address differing). In order to simplify this process, each data source extractor should supply a confidence level for the target source (depending if the source is official, crowd sourced or even curated). Based on this value, we can find out which resource presents more reliable content, being this resource s original information preserved.) 3.3 Information Extraction After merging the information provided by the enrichment sources, the retrieved textual contents are submitted to a term extraction pipeline: A parallel process which comprises two main tasks: named entity recognition[8] on one side and noun-phrase chunking[9] on the other. The latter task is achieved by decoupling sentences into tokens (like words and punctuation marks), tagging these tokens using a POS tagger[10] and subsequently using a chunker to extract phrase chunks marked as noun-phrases, while the named entity recognition task aims to extract entities such as persons, places or organizations from raw text. These tasks are applied on textual contents in order to extract simple and compounded terms, which can be common or proper nouns. Most of the used tools are provided by Apache OpenNLP 7. The tokenizer and the POS tagger models for the Portuguese language were available at OpenNLP s website 8, but the named entity recognizer model had to be trained by us using 6 Since one POI can contain multiple websites, this value stands for the highest string similarity value among two lists of websites. 7 More information on 8 More information on 412

6 a portuguese corpus, Amazônia[11], provided by Linguateca 9. The used chunker was previously developed for a CISUC participation on Págico[12], which makes use of rules extracted from another portuguese resource, Bosque[13]. This pipeline will result in two term lists, both submitted into a term validation process. Term validation is achieved by defining a set of rules which extracted terms should comply. These rules are mostly concerned on the target language, focusing aspects like stopword filtering, unusual character detection, word size and number of words per term. To create the rule set, we had to define what a term is: A set of words that can clearly describe important features of a place: a restaurant type, a particular dish or even some short description like Happy hours. To infer if a term is valid, we must start with the simple form of a term: the word. We use a stopword list 10 to discard single-word terms appearing on it. Besides, a valid word usually contains three or more letters, at least one vowel, should not have camel-case syntax (e.g. casa ) and should not contain unusual punctuation marks. Words related to the POI title or address are discarded, since such terms will not bring any new knowledge about the POI itself. Stepping to an upper level, a term must contain 3 words maximum. To validate this metric, we analyzed all the titles of the portuguese version of Wikipedia (which contains up to 1 million terms) in order to compute the word count frequency. As stated by the histogram present on figure 2, we notice that terms with more than 3 words have a lower frequency and are usually found in place or person names and similar situations, being too specific to describe a set of resources (e.g.: Aeroporto Francisco Sá Carneiro, Diana, princesa de Gales ) Frequency Number of Words per Title Fig. 2. Term Size Frequencies on Portuguese Wikipedia 9 More information on 10 Available at 413

7 3.4 Term Contextualization Being the terms recognized as correct linguistic forms, we proceed inferring if they are meaningful and belong to the portuguese lexicon, using knowledge bases like PAPEL 11 (a lexical resource for natural language processing of Portuguese which contains a large set of lexical relations extracted from an extensive portuguese general dictionary)[14], Wiktionary (a collaborative project to produce a free-content multilingual dictionary) and Wikipedia 12. The usage of Wikipedia as a knowledge source has been widely adopted on the last years, well justified by some authors[8]. The first step is to decide which knowledge source we should use firstly, verifying if the term is single (composed by a single word or hyphenated) or compounded (composed by separate multiple words). If the term fits on the first group, we make use of PAPEL, since it is available as a single file and consequently provides faster access (being mapped to native structures with no http communication involved). It is also the primary choice for single terms since its contents were extracted from a general dictionary, mostly containing single terms. Otherwise, if the term is compounded, we perform an open search query to Wikipedia using the default MediaWiki API 13. If the search query provides satisfactory results, being this computed by the string similarity between the search query and the given results, we can assume that the term exists on the Portuguese language. Case no good results are returned, the term is transformed reducing the number of words comprised on it progressively, in order to achieve a simplification of the term. This is achieved by generating all the possible combinations of the term by removing the words on the left and on the right, resulting in a list of terms sorted in descending order by term size. We continue to query Wikipedia with the shortened terms until the term reaches its unitary form. Then, we make use of PAPEL or, as last resource, Wiktionary. At the end of this process, the terms which were found valid are preserved, being the others discarded. 3.5 Term Weighting In order to evaluate how relevant can a term be to represent a particular place, we calculate each term s weight using TF-IDF. To obtain these values, we go through the contents related to the place which contains the term in order to retrieve the term frequency. Since terms could be composed, we can not simply count the occurrences of each term dividing it by the number of tokens present on each POIs contents. To do so, we split the term in single-word terms and calculate the mean value among all frequencies. This calculation can be biased due to the stopwords that a term can contain. To solve this issue, we discard the frequencies of the stop words, if calculated. To obtain the inverse document frequency, we access all the available documents (the contents contained on the POIs which are already stored) in order to count the documents where the term 11 More information on 12 More information on and 13 More information on 414

8 occurs and divide the total number of documents analyzed by this value. Having these two values, we multiply them in order to obtain the term weight. If there are no textual descriptions available, we can obtain such resources by generalizing our search, using the name of the POI or even its corresponding category. Searching within these attributes and obtaining the first paragraphs of the corresponding Wikipedia pages gives us brand new content to deal with, being this data stored as descriptions and properly mined in order to extract relevant terms from it. By the end of the whole process, we obtain a list of POIs with more information than provided, with a ranked list of representative concepts about each one of them as well. 3.6 Platform Interoperability All platform operations are available through a RESTful 14 API, allowing external services of TICE.Mobilidade to interact with the semantic enrichment module. The data collected on the process can be accessed in a simple way, providing methods like extraction of the most relevant terms within an area, by specifying a bounding box of geographical coordinates or to enrich a particular set of resources. For data storage operations, a PostGIS spatially-enabled PostgreSQL database is being used temporally, soon to be replaced by the Semantic Interoperability Module, which uses an ontology and RDF triples to store all the knowledge data provided from upper-class modules. 4 Experiments and Results This section focus on the experiments performed for each feature of the system, presenting its results in order to validate the proposed approach. 4.1 The Experimental Dataset Focusing our experiment on portuguese territory, we collected points of interest among four different data sources, based on the metropolitan area of Lisbon. The used sources comprise two social networks (Facebook and Foursquare) 15 and two commercial directories (Factual and Lifecooler) 16, in order to diversify the content types. The data was collected using the extractor pipeline described earlier (3.1). Since these extractors only collect data within a given location and a radius, we created a bounding box around all the area and split it into a grid system in order to collect the resources contained on each cell. 14 Representational state transfer 15 More information on and 16 More information on and 415

9 The results of this mass extraction comprise place representations with associated textual descriptions among 766 different categories (with no taxonomy merge). The average description size is around 320 characters, with an high standard deviation (around 400), which demonstrates the asymmetry of the extracted descriptions, with sizes varying from 10 to characters. 4.2 Resource Integration In order to measure the results on the Integration module (3.2), the stored resources were analyzed to detect possible duplicates. Each resource was fetched and all the surrounding points of interest in a 100 meters radius were analyzed to check if there were any duplicates. If duplicates were found, a duplicate set is created, where the first element is the current POI and the duplicates following it. By the end of the process, a duplicate set list is obtained. Table 1 shows the current POI marked as bold and the duplicate candidates below it. Table 1. Simplified example of a duplicate set Name Address Website Distance (m) Categories Restaurante Rua Rossio [larucula.com.pt] [Food & Beverage, La Rúcula Olivais Restaurants] Restaurante Rua Rossio [Local business] La Rúcula Olivais La Rúcula Rossio dos [Italian Restaurant] Olivais Since there is no ground truth available to validate such process in a straightforward way, the duplicate set list was split into smaller parts among eight spreadsheets, each one containing around 300 duplicate sets (each one containing a main POI followed by its possible duplicates). To make the validation task easier, each duplicate set contained the most discriminative attributes for each duplicate candidate: name, address, distance from original POI, websites and categories. The spreadsheets were handled to volunteers from different backgrounds, in order to label the duplicates which were correctly detected. This way, the labeled results consist in true positives and those left blank in false positives. These values allow us to calculate the overall precision for the approach. The precision values obtained vary from 52% to 98%, with a micro-average precision value of 88% (σ = 0.14) and macro-average precision of 92%. The most common situation where this approach produces false positives is related to places that are contained on other places, like a particular store in a shopping mall or a restaurant inside an hotel as can be observed on table 2. The size of the dataset with the lowest precision was smaller than the rest, having multiple cases like this. These false positives can be avoided using specific rules based on regular expressions and by normalizing the categories among all sources. But each source presents a different category taxonomy and the category representation language differs from source to source, which makes the task more 416

10 complex. The temporary solution will stand by normalizing the categories of each source into a single taxonomy during the extraction process. This approach presents a downside, since some enrichment sources contain a poor place category distribution like Facebook, where most of the places relate to a generic category: Local Business. The variation between each POI s location (seen on table 2) can be justified by the origin of the POIs: Being inserted by users (most probably on a mobile device), the location is not always exact. Table 2. The first duplicate set represents a true positive situation and on the second one the last two set members represent a false positive and a true positive, respectively. Name Address Website Distance (m) Categories IBM Portugal Rua do Mar da China, Lote [Office] Companhia Rua Mar da China, ibm.com/pt [Business & Professional Services, IBM Portuguesa Equipment, Supplies Lt & Services, Office] Altis Belem Hotel & Spa, Lisboa Restaurante Feitoria - Altis Belém Hotel & SPA Altis Belém Hotel & SPA Doca do Bom Sucesso Doca do Bom Sucesso Doca do Bom Sucesso [Local business] [Local business] [Hotel] 4.3 Term Extraction So far, around terms have been extracted from the place descriptions present on this dataset, being 4939 of those distinct terms. A few examples containing the POI name, category and related terms are represented on table 3, properly ordered by the respective weight value obtained by computing the TF- IDF value for each one. In order to validate the approach, the term extraction process was split in two different evaluations: term coherence, evaluating the extracted terms per POI, verifying if each term fits a given POI representation; relevance computing, evaluating if the list of terms produced for a certain POI are correctly ordered by each term relevance. The validation process was performed with the help of volunteers from different areas (mostly students) as well. On the term coherence task, volunteers were asked to classify a set of POIs and the extracted terms in 3 categories in order to infer how coherent were the terms extracted: 1 - Less relevant; 2 - Somewhat relevant; 3 - Much relevant. The validation set was composed by 3279 POIs with extracted terms, split among multiple spreadsheets to be equally distributed among the volunteers. 417

11 At the end of the validation process, around 27% of the terms groups were classified as less relevant, 17% somewhat relevant and 56% much relevant. On the relevance computing task, a binary question was asked to the volunteers: Is the following list of terms properly ordered by its relevance regarding this place?. The validation dataset was prominent from the last task, comprising 1759 POIs (the POIs with only one extracted term weren t considered, neither the ones classified as less relevant ). This dataset was split in multiple spreadsheets which were delivered to the same volunteers of the term coherence task, being 73% of the answers yes and 27% no. The results for both tasks were found satisfactory, and once again some patterns were found on the worst results: On the term coherence task, words that do not bring added value to a place description were frequent, even not being stopwords. This issue can be overridden by using a dynamic stopword list, being updated when terms with low relevance show up. Table 3. Example of the Extracted Terms per POI Name Categories Terms Palavra de Viajante - livros [Turismo de Compras, [viagens, mapas, guias, e acessórios de viagem Livrarias] cidades] AEFCSH - UNL [Organization] [associação de estudantes, faculdade de ciências] Restaurante Osaka [Noite e Restaurantes, [sushi, tempura, sashimi] Restaurantes] Indie Rock Café [Noite e Restaurantes, Bares [sex pistols, ramones, e Discotecas] madness] 5 Conclusions and Further Work In this paper we presented our approach on creating a semantic enrichment platform for the Portuguese language. Key points of the approach were validated though some experiments: place integration precision reached a satisfactory 88 percent precision, being the weaknesses of this approach identified, allowing us to improve it and term extraction provided some preliminary results, which need to be tuned in order to assure extraction of meaningful terms that can accurately define a particular place. The term extraction module has also produced good results, extracting meaningful terms around 70 percent of the times and the order of relevance was found correct on 73 percent of the validation dataset. Future work includes improvements on the results of place integration and on the term extraction pipeline. Since we were able to detect the biggest false positive source on the place integration module, the path is now clear to produce better results. Related to the term extraction module, some rules regarding term acceptance need to be revised, since part of the extracted terms did not complied to our expectations. Thus, the validation method should be streamlined in order to reduce the human volunteer effort. The platform integration with TICE.Mobilidade has 418

12 already started: since every component was built in a modular way, the integration process became much simpler. References 1. Alves, A., Pereira, F., Rodrigues, F., Oliveirinha, J.: Place in perspective: Extracting online information about points of interest. In: Ambient Intelligence. Volume 6439 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2010) Dearman, D., Truong, K.N.: Identifying the activities supported by locations with community-authored content. In: Proc. of the 12th ACM int. conference on Ubiquitous computing. Ubicomp 10, New York, NY, USA, ACM (2010) Alazzawi, A.N., Abdelmoty, A.I., Jones, C.B.: What can I do there? Towards the automatic discovery of place-related services and activities. Int. Journal of Geographical Information Science 26(2) (2012) Cano, E., Burel, G., Dadzie, A.S., Ciravegna, F.: Topica: A tool for visualising emerging semantics of POIs based on social awareness streams. In: 10th Int. Semantic Web Conf (ISWC2011) (Demo Track). (2011) 5. Butuc, M.G.: Semantically enriching content using OpenCalais. Interface 9(2) (2009) Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: ACM International Conference on Knowledge Discovery and Data Mining (KDD) 09, Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Citeseer (2003) Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84(406) (1989) Ferreira, L., Teixeira, A., da Silva Cunha, J.P. In: REMMA - Reconhecimento de entidades mencionadas do MedAlert. Linguateca (31 de Dezembro 2008) Milidiú, R.L., Santos, C.N., Duarte, J.C.: Phrase chunking using entropy guided transformation. In: in Proc. of ACL-08: HLT. (2008) Nogueira dos Santos, C., Milidiú, R., Rentería, R.: Portuguese part-of-speech tagging using entropy guided transformation learning. In: Computational Processing of the Portuguese Language. Volume 5190 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2008) Freitas, C., Santos, D.: Blogs, Amazônia e a Floresta Sintá(c)tica: um corpus de um novo gênero? In: Atas do ELC2010. (2012) 12. Rodrigues, R., Gonçalo Oliveira, H., Gomes, P.: Uma abordagem ao Págico baseada no processamento e análise de sintagmas dos tópicos. Linguamática 4(1) (Abril 2012) Freitas, C., Rocha, P., Bick, E.: Floresta Sintá(c)tica: Bigger, Thicker and Easier. In: Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008). Volume Vol , Springer Verlag (8-10 de Setembro 2008) Gonçalo Oliveira, H., Santos, D., Gomes, P., Seco, N.: PAPEL: a dictionary-based lexical ontology for Portuguese. In: Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008). Volume Vol , Springer Verlag (8-10 de Setembro 2008)

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Annotating Spatio-Temporal Information in Documents

Annotating Spatio-Temporal Information in Documents Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de stroetgen@uni-hd.de

More information

Theme Identification in RDF Graphs

Theme Identification in RDF Graphs Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published

More information

Automatically Enriching a Thesaurus with Information from Dictionaries

Automatically Enriching a Thesaurus with Information from Dictionaries Automatically Enriching a Thesaurus with Information from Dictionaries Hugo Gonçalo Oliveira 1 Paulo Gomes {hroliv,pgomes}@dei.uc.pt Cognitive & Media Systems Group CISUC, Universidade de Coimbra October

More information

The University of Évora s Participation in

The University of Évora s Participation in The University of Évora s Participation in QA@CLEF-2007 José Saias and Paulo Quaresma Departamento de Informática Universidade de Évora, Portugal {jsaias,pq}@di.uevora.pt Abstract. The University of Évora

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.

More information

A service based on Linked Data to classify Web resources using a Knowledge Organisation System

A service based on Linked Data to classify Web resources using a Knowledge Organisation System A service based on Linked Data to classify Web resources using a Knowledge Organisation System A proof of concept in the Open Educational Resources domain Abstract One of the reasons why Web resources

More information

3.4 Data-Centric workflow

3.4 Data-Centric workflow 3.4 Data-Centric workflow One of the most important activities in a S-DWH environment is represented by data integration of different and heterogeneous sources. The process of extract, transform, and load

More information

Data and Information Integration: Information Extraction

Data and Information Integration: Information Extraction International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Data and Information Integration: Information Extraction Varnica Verma 1 1 (Department of Computer Science Engineering, Guru Nanak

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008 Ontology Matching with CIDER: Evaluation Report for the OAEI 2008 Jorge Gracia, Eduardo Mena IIS Department, University of Zaragoza, Spain {jogracia,emena}@unizar.es Abstract. Ontology matching, the task

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Sumit Bhatia, Nidhi Rajshree, Anshu Jain, and Nitish Aggarwal IBM Research sumitbhatia@in.ibm.com, {nidhi.rajshree,anshu.n.jain}@us.ibm.com,nitish.aggarwal@ibm.com

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

A Linguistic Approach for Semantic Web Service Discovery

A Linguistic Approach for Semantic Web Service Discovery A Linguistic Approach for Semantic Web Service Discovery Jordy Sangers 307370js jordysangers@hotmail.com Bachelor Thesis Economics and Informatics Erasmus School of Economics Erasmus University Rotterdam

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

Extraction of Web Image Information: Semantic or Visual Cues?

Extraction of Web Image Information: Semantic or Visual Cues? Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus

More information

PRIOR System: Results for OAEI 2006

PRIOR System: Results for OAEI 2006 PRIOR System: Results for OAEI 2006 Ming Mao, Yefei Peng University of Pittsburgh, Pittsburgh, PA, USA {mingmao,ypeng}@mail.sis.pitt.edu Abstract. This paper summarizes the results of PRIOR system, which

More information

Oleksandr Kuzomin, Bohdan Tkachenko

Oleksandr Kuzomin, Bohdan Tkachenko International Journal "Information Technologies Knowledge" Volume 9, Number 2, 2015 131 INTELLECTUAL SEARCH ENGINE OF ADEQUATE INFORMATION IN INTERNET FOR CREATING DATABASES AND KNOWLEDGE BASES Oleksandr

More information

Annotation Component in KiWi

Annotation Component in KiWi Annotation Component in KiWi Marek Schmidt and Pavel Smrž Faculty of Information Technology Brno University of Technology Božetěchova 2, 612 66 Brno, Czech Republic E-mail: {ischmidt,smrz}@fit.vutbr.cz

More information

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad

More information

NLP Final Project Fall 2015, Due Friday, December 18

NLP Final Project Fall 2015, Due Friday, December 18 NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Towards the Automatic Creation of a Wordnet from a Term-based Lexical Network

Towards the Automatic Creation of a Wordnet from a Term-based Lexical Network Towards the Automatic Creation of a Wordnet from a Term-based Lexical Network Hugo Gonçalo Oliveira, Paulo Gomes (hroliv,pgomes)@dei.uc.pt Cognitive & Media Systems Group CISUC, University of Coimbra Uppsala,

More information

EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH

EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH Andreas Walter FZI Forschungszentrum Informatik, Haid-und-Neu-Straße 10-14, 76131 Karlsruhe, Germany, awalter@fzi.de

More information

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation

More information

Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism

More information

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014 NLP Chain Giuseppe Castellucci castellucci@ing.uniroma2.it Web Mining & Retrieval a.a. 2013/2014 Outline NLP chains RevNLT Exercise NLP chain Automatic analysis of texts At different levels Token Morphological

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

3 Publishing Technique

3 Publishing Technique Publishing Tool 32 3 Publishing Technique As discussed in Chapter 2, annotations can be extracted from audio, text, and visual features. The extraction of text features from the audio layer is the approach

More information

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper. Semantic Web Company PoolParty - Server PoolParty - Technical White Paper http://www.poolparty.biz Table of Contents Introduction... 3 PoolParty Technical Overview... 3 PoolParty Components Overview...

More information

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha

@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha @Note2 tutorial Hugo Costa (hcosta@silicolife.com) Ruben Rodrigues (pg25227@alunos.uminho.pt) Miguel Rocha (mrocha@di.uminho.pt) 23-01-2018 The document presents a typical workflow using @Note2 platform

More information

Unstructured Data. CS102 Winter 2019

Unstructured Data. CS102 Winter 2019 Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

A fully-automatic approach to answer geographic queries: GIRSA-WP at GikiP

A fully-automatic approach to answer geographic queries: GIRSA-WP at GikiP A fully-automatic approach to answer geographic queries: at GikiP Johannes Leveling Sven Hartrumpf Intelligent Information and Communication Systems (IICS) University of Hagen (FernUniversität in Hagen)

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

Enhancing applications with Cognitive APIs IBM Corporation

Enhancing applications with Cognitive APIs IBM Corporation Enhancing applications with Cognitive APIs After you complete this section, you should understand: The Watson Developer Cloud offerings and APIs The benefits of commonly used Cognitive services 2 Watson

More information

Customisable Curation Workflows in Argo

Customisable Curation Workflows in Argo Customisable Curation Workflows in Argo Rafal Rak*, Riza Batista-Navarro, Andrew Rowley, Jacob Carter and Sophia Ananiadou National Centre for Text Mining, University of Manchester, UK *Corresponding author:

More information

Semantic Multimedia Information Retrieval Based on Contextual Descriptions

Semantic Multimedia Information Retrieval Based on Contextual Descriptions Semantic Multimedia Information Retrieval Based on Contextual Descriptions Nadine Steinmetz and Harald Sack Hasso Plattner Institute for Software Systems Engineering, Potsdam, Germany, nadine.steinmetz@hpi.uni-potsdam.de,

More information

A Comprehensive Analysis of using Semantic Information in Text Categorization

A Comprehensive Analysis of using Semantic Information in Text Categorization A Comprehensive Analysis of using Semantic Information in Text Categorization Kerem Çelik Department of Computer Engineering Boğaziçi University Istanbul, Turkey celikerem@gmail.com Tunga Güngör Department

More information

Enhanced retrieval using semantic technologies:

Enhanced retrieval using semantic technologies: Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008

More information

GIR experiements with Forostar at GeoCLEF 2007

GIR experiements with Forostar at GeoCLEF 2007 GIR experiements with Forostar at GeoCLEF 2007 Simon Overell 1, João Magalhães 1 and Stefan Rüger 2,1 1 Multimedia & Information Systems Department of Computing, Imperial College London, SW7 2AZ, UK 2

More information

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web Robert Meusel and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {robert,heiko}@informatik.uni-mannheim.de

More information

Falcon-AO: Aligning Ontologies with Falcon

Falcon-AO: Aligning Ontologies with Falcon Falcon-AO: Aligning Ontologies with Falcon Ningsheng Jian, Wei Hu, Gong Cheng, Yuzhong Qu Department of Computer Science and Engineering Southeast University Nanjing 210096, P. R. China {nsjian, whu, gcheng,

More information

Query Expansion using Wikipedia and DBpedia

Query Expansion using Wikipedia and DBpedia Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org

More information

Using Linked Data to Reduce Learning Latency for e-book Readers

Using Linked Data to Reduce Learning Latency for e-book Readers Using Linked Data to Reduce Learning Latency for e-book Readers Julien Robinson, Johann Stan, and Myriam Ribière Alcatel-Lucent Bell Labs France, 91620 Nozay, France, Julien.Robinson@alcatel-lucent.com

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Ontology Extraction from Heterogeneous Documents

Ontology Extraction from Heterogeneous Documents Vol.3, Issue.2, March-April. 2013 pp-985-989 ISSN: 2249-6645 Ontology Extraction from Heterogeneous Documents Kirankumar Kataraki, 1 Sumana M 2 1 IV sem M.Tech/ Department of Information Science & Engg

More information

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision A Semantic Web-Based Approach for Harvesting Multilingual Textual Definitions from Wikipedia to Support ICD-11 Revision Guoqian Jiang 1,* Harold R. Solbrig 1 and Christopher G. Chute 1 1 Department of

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW PAPER ON IMPLEMENTATION OF DOCUMENT ANNOTATION USING CONTENT AND QUERYING

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Fast and Effective System for Name Entity Recognition on Big Data

Fast and Effective System for Name Entity Recognition on Big Data International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

2 Experimental Methodology and Results

2 Experimental Methodology and Results Developing Consensus Ontologies for the Semantic Web Larry M. Stephens, Aurovinda K. Gangam, and Michael N. Huhns Department of Computer Science and Engineering University of South Carolina, Columbia,

More information

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue

More information

TIC: A Topic-based Intelligent Crawler

TIC: A Topic-based Intelligent Crawler 2011 International Conference on Information and Intelligent Computing IPCSIT vol.18 (2011) (2011) IACSIT Press, Singapore TIC: A Topic-based Intelligent Crawler Hossein Shahsavand Baghdadi and Bali Ranaivo-Malançon

More information

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96 ه عا ی Semantic Web Ontology Alignment Morteza Amini Sharif University of Technology Fall 95-96 Outline The Problem of Ontologies Ontology Heterogeneity Ontology Alignment Overall Process Similarity (Matching)

More information

Esfinge (Sphinx) at CLEF 2008: Experimenting with answer retrieval patterns. Can they help?

Esfinge (Sphinx) at CLEF 2008: Experimenting with answer retrieval patterns. Can they help? Esfinge (Sphinx) at CLEF 2008: Experimenting with answer retrieval patterns. Can they help? Luís Fernando Costa 1 Outline Introduction Architecture of Esfinge Answer retrieval patterns Results Conclusions

More information

Creating and Maintaining Vocabularies

Creating and Maintaining Vocabularies CHAPTER 7 This information is intended for the one or more business administrators, who are responsible for creating and maintaining the Pulse and Restricted Vocabularies. These topics describe the Pulse

More information

ALIN Results for OAEI 2016

ALIN Results for OAEI 2016 ALIN Results for OAEI 2016 Jomar da Silva, Fernanda Araujo Baião and Kate Revoredo Department of Applied Informatics Federal University of the State of Rio de Janeiro (UNIRIO), Rio de Janeiro, Brazil {jomar.silva,fernanda.baiao,katerevoredo}@uniriotec.br

More information

Ranking Web Pages by Associating Keywords with Locations

Ranking Web Pages by Associating Keywords with Locations Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

ImgSeek: Capturing User s Intent For Internet Image Search

ImgSeek: Capturing User s Intent For Internet Image Search ImgSeek: Capturing User s Intent For Internet Image Search Abstract - Internet image search engines (e.g. Bing Image Search) frequently lean on adjacent text features. It is difficult for them to illustrate

More information

Application of Individualized Service System for Scientific and Technical Literature In Colleges and Universities

Application of Individualized Service System for Scientific and Technical Literature In Colleges and Universities Journal of Applied Science and Engineering Innovation, Vol.6, No.1, 2019, pp.26-30 ISSN (Print): 2331-9062 ISSN (Online): 2331-9070 Application of Individualized Service System for Scientific and Technical

More information

An Adaptive Framework for Named Entity Combination

An Adaptive Framework for Named Entity Combination An Adaptive Framework for Named Entity Combination Bogdan Sacaleanu 1, Günter Neumann 2 1 IMC AG, 2 DFKI GmbH 1 New Business Department, 2 Language Technology Department Saarbrücken, Germany E-mail: Bogdan.Sacaleanu@im-c.de,

More information

Unsupervised Keyword Extraction from Single Document. Swagata Duari Aditya Gupta Vasudha Bhatnagar

Unsupervised Keyword Extraction from Single Document. Swagata Duari Aditya Gupta Vasudha Bhatnagar Unsupervised Keyword Extraction from Single Document Swagata Duari Aditya Gupta Vasudha Bhatnagar Presentation Outline Introduction and Motivation Statistical Methods for Automatic Keyword Extraction Graph-based

More information

Lecture Video Indexing and Retrieval Using Topic Keywords

Lecture Video Indexing and Retrieval Using Topic Keywords Lecture Video Indexing and Retrieval Using Topic Keywords B. J. Sandesh, Saurabha Jirgi, S. Vidya, Prakash Eljer, Gowri Srinivasa International Science Index, Computer and Information Engineering waset.org/publication/10007915

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

A Tweet Classification Model Based on Dynamic and Static Component Topic Vectors

A Tweet Classification Model Based on Dynamic and Static Component Topic Vectors A Tweet Classification Model Based on Dynamic and Static Component Topic Vectors Parma Nand, Rivindu Perera, and Gisela Klette School of Computer and Mathematical Science Auckland University of Technology

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Columbia University High-Level Feature Detection: Parts-based Concept Detectors TRECVID 2005 Workshop Columbia University High-Level Feature Detection: Parts-based Concept Detectors Dong-Qing Zhang, Shih-Fu Chang, Winston Hsu, Lexin Xie, Eric Zavesky Digital Video and Multimedia Lab

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm ISBN 978-93-84468-0-0 Proceedings of 015 International Conference on Future Computational Technologies (ICFCT'015 Singapore, March 9-30, 015, pp. 197-03 Sense-based Information Retrieval System by using

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources Indian Journal of Science and Technology, Vol 8(23), DOI: 10.17485/ijst/2015/v8i23/79342 September 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Ontology-based Integration and Refinement of Evaluation-Committee

More information

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com

More information

(Not Too) Personalized Learning to Rank for Contextual Suggestion

(Not Too) Personalized Learning to Rank for Contextual Suggestion (Not Too) Personalized Learning to Rank for Contextual Suggestion Andrew Yates 1, Dave DeBoer 1, Hui Yang 1, Nazli Goharian 1, Steve Kunath 2, Ophir Frieder 1 1 Department of Computer Science, Georgetown

More information

Open Research Online The Open University s repository of research publications and other research outputs

Open Research Online The Open University s repository of research publications and other research outputs Open Research Online The Open University s repository of research publications and other research outputs The Smart Book Recommender: An Ontology-Driven Application for Recommending Editorial Products

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

RiMOM Results for OAEI 2010

RiMOM Results for OAEI 2010 RiMOM Results for OAEI 2010 Zhichun Wang 1, Xiao Zhang 1, Lei Hou 1, Yue Zhao 2, Juanzi Li 1, Yu Qi 3, Jie Tang 1 1 Tsinghua University, Beijing, China {zcwang,zhangxiao,greener,ljz,tangjie}@keg.cs.tsinghua.edu.cn

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Using Linked Data and taxonomies to create a quick-start smart thesaurus

Using Linked Data and taxonomies to create a quick-start smart thesaurus 7) MARJORIE HLAVA Using Linked Data and taxonomies to create a quick-start smart thesaurus 1. About the Case Organization The two current applications of this approach are a large scientific publisher

More information

INTERACTIVE MULTI-OBJECTIVE GENETIC ALGORITHMS FOR THE BUS DRIVER SCHEDULING PROBLEM

INTERACTIVE MULTI-OBJECTIVE GENETIC ALGORITHMS FOR THE BUS DRIVER SCHEDULING PROBLEM Advanced OR and AI Methods in Transportation INTERACTIVE MULTI-OBJECTIVE GENETIC ALGORITHMS FOR THE BUS DRIVER SCHEDULING PROBLEM Jorge PINHO DE SOUSA 1, Teresa GALVÃO DIAS 1, João FALCÃO E CUNHA 1 Abstract.

More information

Iterative Learning of Relation Patterns for Market Analysis with UIMA

Iterative Learning of Relation Patterns for Market Analysis with UIMA UIMA Workshop, GLDV, Tübingen, 09.04.2007 Iterative Learning of Relation Patterns for Market Analysis with UIMA Sebastian Blohm, Jürgen Umbrich, Philipp Cimiano, York Sure Universität Karlsruhe (TH), Institut

More information

A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation

A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation Mariana Neves 1, Monica Chagoyen 1, José M Carazo 1, Alberto Pascual-Montano

More information

Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Internet

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets 2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation

More information

Dynamically Building Facets from Their Search Results

Dynamically Building Facets from Their Search Results Dynamically Building Facets from Their Search Results Anju G. R, Karthik M. Abstract: People are very passionate in searching new things and gaining new knowledge. They usually prefer search engines to

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information