Semantic Enrichment of Places for the Portuguese Language

Semantic Enrichment of Places for the Portuguese Language Jorge O. Santos 1, Ana O. Alves 1,2, Francisco C. Pereira 1, Pedro H. Abreu 1 1 CISUC, Centre for Informatics and Systems of the University of Coimbra, Portugal jasant@student.dei.uc.pt,{ana,camara,pha}@dei.uc.pt 2 ISEC, Coimbra Institute of Engineering, Portugal aalves@isec.pt Abstract. A large amount of descriptive information about places is widely spread on the web, provided by services such as commercial directories or social networks in a crowd sourced way, which can lead us to classify a particular location based on the atomic entities comprised on it: points of interest. However, different services provide different types of information and currently there is no easy way to unify this kind of data to get more descriptive information than the classic place representation: a pair of coordinates and a name. In this article, we present a platform which is able to semantically enrich places. This can be achieved by integrating multiple data sources and extracting meaningful terms to automatically label places applying Natural Language Processing and Information Extraction techniques on contents fetched from the web, being such terms validated on knowledge bases like Wikipedia or Wiktionary. The approach focus the Portuguese language, aiming to be part of a semantic data gateway for TICE.Mobilidade, a mobility platform. Keywords: Semantics of Place, Information Extraction, Automatic Tagging, Natural Language Processing 1 Introduction In this paper we present our approach on developing a dynamic place enrichment platform for the Portuguese language, being able to continuously gather place information among multiple sources like location-enabled social networks or commercial directories. The approach is inspired on KUSCO[1], a system that extracts a ranked list of concepts having a set of webpages containing place information as input for the English language. This work aims to evolve KUSCO into a dynamic and robust platform, allowing the integration different information sources and applying its methodology to the Portuguese language. By assigning semantic annotations to places, we are able to characterize locations based on the points of interest (POIs) that comprise them, an useful resource to external services such as a tour planner that chooses places to visit based on the user preferences. This approach is currently being implemented as a module of TICE.Mobilidade, a project whose vision stands by the creation of a digital 408

platform providing mobility services on Portuguese territory. The module will be responsible for transforming raw information originally supplied by multiple data providers into valuable knowledge for other modules, which offer a wide range of services for the end-user. Besides the semantic enrichment service (detailed in this paper), the referred module also comprises two other services: one responsible for storing knowledge (Semantic Interoperability service) and another one providing a recommendation system based on the user habits (Artificial Selective Attention service). The next section aims to contextualize the author with some essential background, describing related work on the area. On the third section the proposed approach is specified, describing the key points of the semantic enrichment process, focusing the architectural approach as well, where each module is described based on its contribute for the approach. Finally, we present some experiments and discuss the validation of the results obtained on it. 2 Background The rising popularity of location-based social networks and other communitybased data sources is a relevant information source for place-related data: Containing information like recommendations about a particular dish on a restaurant, a show that is usually performed on a concert site or even transportationrelated information about how to get to a particular venue. Many authors started to take advantage of such resources in order to create useful services by mining this type of information. KUSCO[1] (Knowledge Unsupervised Search for populating Concepts on Ontologies), the main inspiration for the approach proposed in this paper, allows the annotation of places and events by extracting meaningful concepts from online texts, enriching these with semantic information. The system receive a POI source as input (a collection of documents, a directory or an API) which is mapped on the conceptual data model, automatically populating a POI database. Based on the data collected, a document search about each POI is performed exploring the World Wide Web and Wikipedia in four different ways. The web is explored in two distinct methods: open and focused. Wikipedia is used in two different ways, to locate generic and specific information about places. These search methods are named Perspectives of Semantic Enrichment. From the documents retrieved from these data sources, a bag of relevant concepts (semantic indexes) is extracted, being each concept weighted using statistical relevance metrics. Dearman and Truong[2] used reviews from Yelp s authored community to identify activities supported by a certain location. By harvesting review pages and mining its content with information extraction and natural language processing techniques like tokenization and part-of-speech (POS) tagging, the authors were able to extract verb-noun pairs from the reviews representing possible activities which can be performed at a particular location, like eat pizza or order latte, validating the use of community-authored content to identify activities. 409

A.N. Alazzawi et al.[3] proposed a methodology to identify concepts of services offered on a particular place and activities that can be performed in there. By detecting language patterns of place affordance on POS tagged texts from collections of services types, the authors apply these patterns to other resources present on the web in order to retrieve service and activity concepts. This way, they are able to contextualize places giving particular relevance to the activities performed on such location (e.g.: Church has associated concepts like Worship, ChristianWorship or HoldChristianServices ). Topica[4] is a web application that using Facebook Places API retrieves POIs in a given area, relating each one to its corresponding fan page, in order to extract user comments from it. These comments are enriched using external resources such as OpenCalais[5] and Zemanta 3, services that are able to extract entities, keywords and related pages from a supplied text. This data is used to model the comment as a DBpedia 4 resource and queries DBpedia in order to extract a set of categories from it. Each category gets a weight based on TF-IDF (Term- Frequency x Inverse Document Frequency) calculation, regarding the contents present on the current fan page and those already retrieved. Finally it presents the collected data on a map interface, where the user can access the retrieved data in order to get textual and multimedia contents. Our approach shares some aspects with the works of Dearman and Truong and A.N. Alazzawi et al., but like with KUSCO, we are more focused on the meaning of each place, gathering information around a dynamic set of sources and generating terms that can be considered meaningful to a given place. The main innovations comparing our approach to KUSCO are the dynamic handling of data sources, the support for the Portuguese language and its availability through a robust platform. 3 Approach This approach is organized following the platform s architectural structure, being here described how each module works and how it relates to the others. Each module is responsible for a specific set of tasks, being these described regarding the role played on the platform. Figure 1 presents an overview on the whole semantic enrichment process. 3.1 Enrichment Source Extraction The process is triggered when a list of POIs is received, being each POI s information sent to a set of extractor scripts, one per information source. These scripts receive a text query, a pair of coordinates and a distance radius as input (e.g.: ( Torre de Belém, [38.691596, -9.215998], 500)), in order to retrieve points of interest contained on such location. The data extracted here comprises information like descriptions, user reviews, usage statistics (measured by some social networks, in terms of user interaction), contacts and even directions to 3 More information on http://www.zemanta.com/ 4 More information on http://dbpedia.org/spotlight 410

POI's Web Data Collection over multiple enrichment sources Enrichment Sources Integration and duplicate detection Part-of-Speech Tagging Noun-Phrase Chunking Named Entity Recognition PAPEL Contextualization and Term Validation Wikipedia Relevance Computing Wiktionary Semantic Concepts Fig. 1. Semantic Enrichment Model a place using public transportation. The existing extractors are spawned in a thread pool with the provided parameters, being executed concurrently. This initial phase may achieve a set of POIs, being these handled to the integration module as a standard response, using the JSON 5 format with all the extracted data properly structured. This input/output standard (defined as an interface pattern on the system) allows us to customize these submodules, making the platform able to interact with new extractors when the need of using new data sources arise. 3.2 Resource Integration The data fetched by the extractors is now submitted to the integration module, where each fetched POI is analyzed in order to verify if we are dealing with the same entity. Since we are handling multiple information sources, there is a high probability of finding duplicate POIs on the process. We take advantage of such duplication to collect complementary data that did not exist on the original POI. 5 JavaScript Object Notation 411

In order to detect duplicates, the process goes through the attributes present on the POI, checking essentially how similar are the names and websites (when available) between multiple POIs. In order to compute this similarity, we make use of an open-source library[6] which provides a wide range of string similarity algorithms[7]. We consider two POIs as duplicate candidates if they fit in the following groups: The distance between both POIs is less than 100 meters, the name similarity is above 0.9 and no website information is available in one or both POIs. The distance between both POIs is less than 100 meters, the name similarity is above 0.7 and the best match among the available websites 6 above 0.5. These thresholds were obtained as result of an iterative process, starting with lower values and analyzing the detected duplicates to verify if the matches were acceptable. Being the first group only based on two attributes, we had to set the name similarity threshold to a higher level to assure a low rate of false positives. If a duplicate is detected and it brings new information for the older resource, both resources are merged. However, since some sources are essentially composed by user-generated content, we need to be aware of how reliable can an information source be. This process should be carefully performed, since there can be overlapping information (e.g. name or address differing). In order to simplify this process, each data source extractor should supply a confidence level for the target source (depending if the source is official, crowd sourced or even curated). Based on this value, we can find out which resource presents more reliable content, being this resource s original information preserved.) 3.3 Information Extraction After merging the information provided by the enrichment sources, the retrieved textual contents are submitted to a term extraction pipeline: A parallel process which comprises two main tasks: named entity recognition[8] on one side and noun-phrase chunking[9] on the other. The latter task is achieved by decoupling sentences into tokens (like words and punctuation marks), tagging these tokens using a POS tagger[10] and subsequently using a chunker to extract phrase chunks marked as noun-phrases, while the named entity recognition task aims to extract entities such as persons, places or organizations from raw text. These tasks are applied on textual contents in order to extract simple and compounded terms, which can be common or proper nouns. Most of the used tools are provided by Apache OpenNLP 7. The tokenizer and the POS tagger models for the Portuguese language were available at OpenNLP s website 8, but the named entity recognizer model had to be trained by us using 6 Since one POI can contain multiple websites, this value stands for the highest string similarity value among two lists of websites. 7 More information on http://opennlp.apache.org/ 8 More information on http://opennlp.sourceforge.net/models-1.5/ 412

a portuguese corpus, Amazônia[11], provided by Linguateca 9. The used chunker was previously developed for a CISUC participation on Págico[12], which makes use of rules extracted from another portuguese resource, Bosque[13]. This pipeline will result in two term lists, both submitted into a term validation process. Term validation is achieved by defining a set of rules which extracted terms should comply. These rules are mostly concerned on the target language, focusing aspects like stopword filtering, unusual character detection, word size and number of words per term. To create the rule set, we had to define what a term is: A set of words that can clearly describe important features of a place: a restaurant type, a particular dish or even some short description like Happy hours. To infer if a term is valid, we must start with the simple form of a term: the word. We use a stopword list 10 to discard single-word terms appearing on it. Besides, a valid word usually contains three or more letters, at least one vowel, should not have camel-case syntax (e.g. casa ) and should not contain unusual punctuation marks. Words related to the POI title or address are discarded, since such terms will not bring any new knowledge about the POI itself. Stepping to an upper level, a term must contain 3 words maximum. To validate this metric, we analyzed all the titles of the portuguese version of Wikipedia (which contains up to 1 million terms) in order to compute the word count frequency. As stated by the histogram present on figure 2, we notice that terms with more than 3 words have a lower frequency and are usually found in place or person names and similar situations, being too specific to describe a set of resources (e.g.: Aeroporto Francisco Sá Carneiro, Diana, princesa de Gales ). 400000 350000 300000 250000 Frequency 200000 150000 100000 50000 0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252628 Number of Words per Title Fig. 2. Term Size Frequencies on Portuguese Wikipedia 9 More information on http://www.linguateca.pt/ 10 Available at http://snowball.tartarus.org/algorithms/portuguese/stop.txt 413

3.4 Term Contextualization Being the terms recognized as correct linguistic forms, we proceed inferring if they are meaningful and belong to the portuguese lexicon, using knowledge bases like PAPEL 11 (a lexical resource for natural language processing of Portuguese which contains a large set of lexical relations extracted from an extensive portuguese general dictionary)[14], Wiktionary (a collaborative project to produce a free-content multilingual dictionary) and Wikipedia 12. The usage of Wikipedia as a knowledge source has been widely adopted on the last years, well justified by some authors[8]. The first step is to decide which knowledge source we should use firstly, verifying if the term is single (composed by a single word or hyphenated) or compounded (composed by separate multiple words). If the term fits on the first group, we make use of PAPEL, since it is available as a single file and consequently provides faster access (being mapped to native structures with no http communication involved). It is also the primary choice for single terms since its contents were extracted from a general dictionary, mostly containing single terms. Otherwise, if the term is compounded, we perform an open search query to Wikipedia using the default MediaWiki API 13. If the search query provides satisfactory results, being this computed by the string similarity between the search query and the given results, we can assume that the term exists on the Portuguese language. Case no good results are returned, the term is transformed reducing the number of words comprised on it progressively, in order to achieve a simplification of the term. This is achieved by generating all the possible combinations of the term by removing the words on the left and on the right, resulting in a list of terms sorted in descending order by term size. We continue to query Wikipedia with the shortened terms until the term reaches its unitary form. Then, we make use of PAPEL or, as last resource, Wiktionary. At the end of this process, the terms which were found valid are preserved, being the others discarded. 3.5 Term Weighting In order to evaluate how relevant can a term be to represent a particular place, we calculate each term s weight using TF-IDF. To obtain these values, we go through the contents related to the place which contains the term in order to retrieve the term frequency. Since terms could be composed, we can not simply count the occurrences of each term dividing it by the number of tokens present on each POIs contents. To do so, we split the term in single-word terms and calculate the mean value among all frequencies. This calculation can be biased due to the stopwords that a term can contain. To solve this issue, we discard the frequencies of the stop words, if calculated. To obtain the inverse document frequency, we access all the available documents (the contents contained on the POIs which are already stored) in order to count the documents where the term 11 More information on http://www.linguateca.pt/papel 12 More information on http://pt.wiktionary.org and http://pt.wikipedia.org 13 More information on http://www.mediawiki.org/wiki/api:main_page 414

occurs and divide the total number of documents analyzed by this value. Having these two values, we multiply them in order to obtain the term weight. If there are no textual descriptions available, we can obtain such resources by generalizing our search, using the name of the POI or even its corresponding category. Searching within these attributes and obtaining the first paragraphs of the corresponding Wikipedia pages gives us brand new content to deal with, being this data stored as descriptions and properly mined in order to extract relevant terms from it. By the end of the whole process, we obtain a list of POIs with more information than provided, with a ranked list of representative concepts about each one of them as well. 3.6 Platform Interoperability All platform operations are available through a RESTful 14 API, allowing external services of TICE.Mobilidade to interact with the semantic enrichment module. The data collected on the process can be accessed in a simple way, providing methods like extraction of the most relevant terms within an area, by specifying a bounding box of geographical coordinates or to enrich a particular set of resources. For data storage operations, a PostGIS spatially-enabled PostgreSQL database is being used temporally, soon to be replaced by the Semantic Interoperability Module, which uses an ontology and RDF triples to store all the knowledge data provided from upper-class modules. 4 Experiments and Results This section focus on the experiments performed for each feature of the system, presenting its results in order to validate the proposed approach. 4.1 The Experimental Dataset Focusing our experiment on portuguese territory, we collected points of interest among four different data sources, based on the metropolitan area of Lisbon. The used sources comprise two social networks (Facebook and Foursquare) 15 and two commercial directories (Factual and Lifecooler) 16, in order to diversify the content types. The data was collected using the extractor pipeline described earlier (3.1). Since these extractors only collect data within a given location and a radius, we created a bounding box around all the area and split it into a grid system in order to collect the resources contained on each cell. 14 Representational state transfer 15 More information on http://www.facebook.com/ and https://foursquare.com/ 16 More information on http://www.factual.com/ and http://www.lifecooler.com/ 415

The results of this mass extraction comprise 26.843 place representations with 7.941 associated textual descriptions among 766 different categories (with no taxonomy merge). The average description size is around 320 characters, with an high standard deviation (around 400), which demonstrates the asymmetry of the extracted descriptions, with sizes varying from 10 to 10.000 characters. 4.2 Resource Integration In order to measure the results on the Integration module (3.2), the stored resources were analyzed to detect possible duplicates. Each resource was fetched and all the surrounding points of interest in a 100 meters radius were analyzed to check if there were any duplicates. If duplicates were found, a duplicate set is created, where the first element is the current POI and the duplicates following it. By the end of the process, a duplicate set list is obtained. Table 1 shows the current POI marked as bold and the duplicate candidates below it. Table 1. Simplified example of a duplicate set Name Address Website Distance (m) Categories Restaurante Rua Rossio [larucula.com.pt] [Food & Beverage, La Rúcula Olivais Restaurants] Restaurante Rua Rossio 61.53 [Local business] La Rúcula Olivais La Rúcula Rossio dos 77.21 [Italian Restaurant] Olivais Since there is no ground truth available to validate such process in a straightforward way, the duplicate set list was split into smaller parts among eight spreadsheets, each one containing around 300 duplicate sets (each one containing a main POI followed by its possible duplicates). To make the validation task easier, each duplicate set contained the most discriminative attributes for each duplicate candidate: name, address, distance from original POI, websites and categories. The spreadsheets were handled to volunteers from different backgrounds, in order to label the duplicates which were correctly detected. This way, the labeled results consist in true positives and those left blank in false positives. These values allow us to calculate the overall precision for the approach. The precision values obtained vary from 52% to 98%, with a micro-average precision value of 88% (σ = 0.14) and macro-average precision of 92%. The most common situation where this approach produces false positives is related to places that are contained on other places, like a particular store in a shopping mall or a restaurant inside an hotel as can be observed on table 2. The size of the dataset with the lowest precision was smaller than the rest, having multiple cases like this. These false positives can be avoided using specific rules based on regular expressions and by normalizing the categories among all sources. But each source presents a different category taxonomy and the category representation language differs from source to source, which makes the task more 416

complex. The temporary solution will stand by normalizing the categories of each source into a single taxonomy during the extraction process. This approach presents a downside, since some enrichment sources contain a poor place category distribution like Facebook, where most of the places relate to a generic category: Local Business. The variation between each POI s location (seen on table 2) can be justified by the origin of the POIs: Being inserted by users (most probably on a mobile device), the location is not always exact. Table 2. The first duplicate set represents a true positive situation and on the second one the last two set members represent a false positive and a true positive, respectively. Name Address Website Distance (m) Categories IBM Portugal Rua do Mar da China, Lote 1.07.2.3 [Office] Companhia Rua Mar da China, ibm.com/pt 14.31 [Business & Professional Services, IBM Portuguesa Equipment, Supplies Lt. 1.07.2.3 & Services, Office] Altis Belem Hotel & Spa, Lisboa Restaurante Feitoria - Altis Belém Hotel & SPA Altis Belém Hotel & SPA Doca do Bom Sucesso Doca do Bom Sucesso Doca do Bom Sucesso [Local business] 40.84 [Local business] 69.04 [Hotel] 4.3 Term Extraction So far, around 10.000 terms have been extracted from the place descriptions present on this dataset, being 4939 of those distinct terms. A few examples containing the POI name, category and related terms are represented on table 3, properly ordered by the respective weight value obtained by computing the TF- IDF value for each one. In order to validate the approach, the term extraction process was split in two different evaluations: term coherence, evaluating the extracted terms per POI, verifying if each term fits a given POI representation; relevance computing, evaluating if the list of terms produced for a certain POI are correctly ordered by each term relevance. The validation process was performed with the help of volunteers from different areas (mostly students) as well. On the term coherence task, volunteers were asked to classify a set of POIs and the extracted terms in 3 categories in order to infer how coherent were the terms extracted: 1 - Less relevant; 2 - Somewhat relevant; 3 - Much relevant. The validation set was composed by 3279 POIs with extracted terms, split among multiple spreadsheets to be equally distributed among the volunteers. 417

At the end of the validation process, around 27% of the terms groups were classified as less relevant, 17% somewhat relevant and 56% much relevant. On the relevance computing task, a binary question was asked to the volunteers: Is the following list of terms properly ordered by its relevance regarding this place?. The validation dataset was prominent from the last task, comprising 1759 POIs (the POIs with only one extracted term weren t considered, neither the ones classified as less relevant ). This dataset was split in multiple spreadsheets which were delivered to the same volunteers of the term coherence task, being 73% of the answers yes and 27% no. The results for both tasks were found satisfactory, and once again some patterns were found on the worst results: On the term coherence task, words that do not bring added value to a place description were frequent, even not being stopwords. This issue can be overridden by using a dynamic stopword list, being updated when terms with low relevance show up. Table 3. Example of the Extracted Terms per POI Name Categories Terms Palavra de Viajante - livros [Turismo de Compras, [viagens, mapas, guias, e acessórios de viagem Livrarias] cidades] AEFCSH - UNL [Organization] [associação de estudantes, faculdade de ciências] Restaurante Osaka [Noite e Restaurantes, [sushi, tempura, sashimi] Restaurantes] Indie Rock Café [Noite e Restaurantes, Bares [sex pistols, ramones, e Discotecas] madness] 5 Conclusions and Further Work In this paper we presented our approach on creating a semantic enrichment platform for the Portuguese language. Key points of the approach were validated though some experiments: place integration precision reached a satisfactory 88 percent precision, being the weaknesses of this approach identified, allowing us to improve it and term extraction provided some preliminary results, which need to be tuned in order to assure extraction of meaningful terms that can accurately define a particular place. The term extraction module has also produced good results, extracting meaningful terms around 70 percent of the times and the order of relevance was found correct on 73 percent of the validation dataset. Future work includes improvements on the results of place integration and on the term extraction pipeline. Since we were able to detect the biggest false positive source on the place integration module, the path is now clear to produce better results. Related to the term extraction module, some rules regarding term acceptance need to be revised, since part of the extracted terms did not complied to our expectations. Thus, the validation method should be streamlined in order to reduce the human volunteer effort. The platform integration with TICE.Mobilidade has 418

already started: since every component was built in a modular way, the integration process became much simpler. References 1. Alves, A., Pereira, F., Rodrigues, F., Oliveirinha, J.: Place in perspective: Extracting online information about points of interest. In: Ambient Intelligence. Volume 6439 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2010) 61 72 2. Dearman, D., Truong, K.N.: Identifying the activities supported by locations with community-authored content. In: Proc. of the 12th ACM int. conference on Ubiquitous computing. Ubicomp 10, New York, NY, USA, ACM (2010) 23 32 3. Alazzawi, A.N., Abdelmoty, A.I., Jones, C.B.: What can I do there? Towards the automatic discovery of place-related services and activities. Int. Journal of Geographical Information Science 26(2) (2012) 345 364 4. Cano, E., Burel, G., Dadzie, A.S., Ciravegna, F.: Topica: A tool for visualising emerging semantics of POIs based on social awareness streams. In: 10th Int. Semantic Web Conf (ISWC2011) (Demo Track). (2011) 5. Butuc, M.G.: Semantically enriching content using OpenCalais. Interface 9(2) (2009) 77 80 6. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: ACM International Conference on Knowledge Discovery and Data Mining (KDD) 09, Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Citeseer (2003) 73 78 7. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84(406) (1989) 414 420 8. Ferreira, L., Teixeira, A., da Silva Cunha, J.P. In: REMMA - Reconhecimento de entidades mencionadas do MedAlert. Linguateca (31 de Dezembro 2008) 213 229 9. Milidiú, R.L., Santos, C.N., Duarte, J.C.: Phrase chunking using entropy guided transformation. In: in Proc. of ACL-08: HLT. (2008) 647 655 10. Nogueira dos Santos, C., Milidiú, R., Rentería, R.: Portuguese part-of-speech tagging using entropy guided transformation learning. In: Computational Processing of the Portuguese Language. Volume 5190 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2008) 143 152 11. Freitas, C., Santos, D.: Blogs, Amazônia e a Floresta Sintá(c)tica: um corpus de um novo gênero? In: Atas do ELC2010. (2012) 12. Rodrigues, R., Gonçalo Oliveira, H., Gomes, P.: Uma abordagem ao Págico baseada no processamento e análise de sintagmas dos tópicos. Linguamática 4(1) (Abril 2012) 31 39 13. Freitas, C., Rocha, P., Bick, E.: Floresta Sintá(c)tica: Bigger, Thicker and Easier. In: Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008). Volume Vol. 5190., Springer Verlag (8-10 de Setembro 2008) 216 219 14. Gonçalo Oliveira, H., Santos, D., Gomes, P., Seco, N.: PAPEL: a dictionary-based lexical ontology for Portuguese. In: Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008). Volume Vol. 5190., Springer Verlag (8-10 de Setembro 2008) 31 40 419