Semantic Annotation of Web Resources Using IdentityRank and Wikipedia

Size: px

Start display at page:

Download "Semantic Annotation of Web Resources Using IdentityRank and Wikipedia"

Colin Hill
5 years ago
Views:

1 Semantic Annotation of Web Resources Using IdentityRank and Wikipedia Norberto Fernández, José M.Blázquez, Luis Sánchez, and Vicente Luque Telematic Engineering Department. Carlos III University of Madrid Summary. In this paper we introduce the IdentityRank algorithm developed to address the problem of named entity disambiguation. It is used for semantic annotation of Web resources taking Wikipedia as knowledge source. 1 Introduction In order to make the Semantic Web [1] vision become a reality, the semantics of the data needs to be described in a computer understandable manner. This process is known in the literature as semantic annotation. In [2] we introduced a system that exploited user queries to generate annotations and used the information generated and maintained by Wikipedia 1 editors as knowledge source for the annotation process. As is indicated in [2], our system had some limitations, for instance, it is a manual system so it requires user collaboration to gather metadata. More automatized semantic annotation approaches seem more appropriate for the annotation of high volumes of information due to scalability reasons. In order to deal with these limitations, we have extended our initial proposal by including an information extraction tool, ANNIE 2, into our system. With such tool, we process the textual contents of the Web resources to be annotated, extracting occurrences of named entities: persons, locations and organizations. Once we have these entities, in order to generate semantic annotations, we need to link each entity (e.g. the person Alonso) with its Wikipedia page. As there are usually several instances that can be associated to a certain entity 3, we need to disambiguate that entity selecting the Wikipedia page that best represents it in the context of the document being annotated. To address this need, we have developed an algorithm for named entity disambiguation based on Google s PageRank [3] that we name IdentityRank (a.k.a. IdRank) For instance for the person Alonso we have among others Fernando Alonso, a Formula 1 driver, Alonso, and José Antonio Alonso a Spanish Minister. Jos%C3%A9 Antonio Alonso Su%C3%A1rez K.M. W egrzyn-wolska and P.S. Szczepaniak (Eds.): Adv. in Intel. Web, ASC 43, pp , springerlink.com c Springer-Verlag Berlin Heidelberg 2007

2 Semantic Annotation of Web Resources 101 The rest of this paper describes the IdRank algorithm and is organized as follows: section 2 describes the IdRank algorithm and shows some results of an initial evaluation, section 3 elaborates on related work and finally, section 4 with concluding remarks and future lines ends this paper. 2 IdRank In this section we will describe in detail the IdRank algorithm and the results of an initial evaluation of that algorithm. Due to the lack of space we will not describe here the PageRank algorithm. The interested reader can find a comprehensive description of that algorithm in [3]. 2.1 IdRank Process The manual annotation process described in [2] required from the user the annotation or disambiguation of the terms in his/her query using concepts represented by Wikipedia pages and the usage of relevance feedback to indicate that a certain Web resource was relevant for that query. By doing so a new annotation was generated linking the Web resource with the person who has generated the annotation and with the concepts in the annotated query. The system has been now extended, so when a user provides a manual annotation as described above, an automatic process starts. This process consist in downloading the Web resource which is going to be annotated and automatically extracting from its contents the entities mentioned there using ANNIE. For each entity, the entity text and the entity type (person, location, organization) are provided. Additionally, the links to Wikipedia pages in the contents are also extracted, because they can be considered as annotations introduced by the page author at authoring time. Now IdRank can run, using the information already available: the manuallygenerated annotation, the links to Wikipedia pages mentioned by the Web resource and the entities. The IdRank process consist of the following steps: Candidate finding. The system finds the URLs of the Wikipedia pages which are candidates to represent each of the input entities. In order to do so, the system uses Yahoo APIs 4 to query Yahoo with a site restriction wikipedia.org as many times as different entities (different pair entity text/entity type) need to be disambiguated. The resultant set of Wikipedia URLs is modified by adding the Wikipedia URLs extracted from the Web resource content and the ones used in the manual annotation. In the case of URLs obtained from queries, we store also the position of each URL in the original Yahoo result set for later usage. Duplicate removal. The algorithm processes the Wikipedia URL set to filter duplicates. One of the difficulties of this filtering process is the fact that there are several Wikipedia URLs representing the same concept (pages in different 4

3 102 N. Fernández et al. languages, redirections). Due to this, the filtering process requires to download and process the candidate Wikipedia pages extracting the language links from that pages and detecting HTTP redirections when downloading a certain page. Once that we know the different Wikipedia URLs that can represent the same concept, we can assign a unique identifier to the concept (a URI) and store the mapping between that URI and the original Wikipedia URLs. So given the original Wikipedia URL set we obtain a set of unique URIs in which each URI, each concept, appears only once. In this page-processing step, we also extract the links between Wikipedia pages, which will be used in next step. Ranking computation. A semantic network is built with the URIs that result from the duplicate removal process. In such network, nodes are concepts represented by URIs. There can be two kinds of links between that nodes: 1. A bidirectional anchor link between node u and node v appears if there is an HTML link between any of the Wikipedia pages that represent the concept u and any of the Wikipedia pages that represent the concept v or vice versa. 2. A bidirectional cooccurrence link between nodes u and v appears if there are former manual annotations defined by this or other users which use the concept u and the concept v in annotating the same Web resource (exploits the information about cooccurrence of concepts in Web resources). We will give weights to these links. The anchor links are handled in the same way as in original PageRank, that is, each node gives the same weight to all of its forward links. The weight of the cooccurrence links, not included in PageRank, is computed using the cooccurrence frequency of the linked concepts. Mathematically, this can be expressed as: α uv = f uv kɛc v f kv Where f uv is the cooccurrence frequency of concepts u and v, that is, the number of Web resources annotated both with u and v divided by the number of Web resources annotated with v. C v is the set of concepts in the semantic network that cooccur with v in at least the annotations of one Web resource apart from the one being analyzed. Apart from link information, the original PageRank algorithm included a vector E used for ranking personalization giving more weight to certain nodes in the network. In IdRank, the values of this vector are computed taking into account the usage in the recent past of the concept u in the annotations of the same user who is defining the current annotation. In that sense the algorithm learns from past user annotations. In practice, the value of the u component of the vector E, E(u), is directly proportional to the number of times the concept u has been used in the last M annotations performed by the user, being M a parameter of the algorithm. (1)

4 Semantic Annotation of Web Resources 103 Taking into account all these contributions we obtain the following equation, adaptation of the original PageRank equation in [3]: R(u) =k A [ 1 (β 1 + β 2 α uv )R(v)] + k E E(u, M) (2) N v v S Where R(u) is the ranking of the node u, S is the set of nodes in the semantic network, β 1 =1/2 if there is an anchor link between v and u or 0 otherwise, β 2 =1/2 if there is a cooccurrence link between u and v (u v) or0otherwise and N v is the number of anchor links of v. In order to control the influence on the final results of each of the components of the algorithm we use two constants k A and k E such as k A + k E =1. We solve this set of equations for each value of u using appropriate numerical methods, as the one described in [3], obtaining as result a weight for each of the candidate concepts in the semantic network. Then we translate back the URIs of the concepts to Wikipedia page URLs using the table generated in the duplicate removal step. Each URL, associated to a certain URI, is assigned the same weight as the algorithm gives to the URI. For each of the original entities the algorithm assigns as Wikipedia representation the candidate whose URL has highest weight. If a certain entity has more than one candidate with maximal weight, the algorithm uses the original Yahoo ranking to decide. 2.2 Evaluation We have carried out a basic experiment to test the behavior of the IdRank algorithm. In that experiment we use a corpus of ten documents that were obtained by querying a repository of news items looking for Alonso and selecting randomly some documents. The entities in these documents were automatically detected, but, in order to avoid the noise introduced in the evaluation of the disambiguation algorithm by the errors in the entity extraction process, the entities were reviewed by two human users. At the end we got 118 entities, 65 of them unique. For each entity, we looked for the entity text in Yahoo with a literal query (among quotes) and a restriction site:wikipedia.org in order to find its candidates. We limited to ten the number of results returned by the search engine and filtered special pages of Wikipedia (like user pages and talk pages) from the result set. Additionally, we have manually reviewed the candidates information in order to check whether ten results per entity were enough for the process, and we got that only in 7 cases (4 different entities) there was not any Wikipedia page in the result set that could be used to represent the real meaning of the entity. We compared our algorithm with two other naive algorithms: one simply assigns to each entity the first result obtained from Yahoo when looking for the entity text in Wikipedia. The other one simply computes the Levenshtein distance between the entity text and the Wikipedia page title using the SimMetrics library 5 and assigns to each entity the Wikipedia page whose title is more similar 5

5 104 N. Fernández et al. to the entity text. Additionally we tested our algorithm in two modes: working with user history (past annotations) and without user history (that is, only using the information on anchor links in the disambiguation process). We build the history by randomly selecting two entities in each document and manually annotating them. The parameters of the algorithm were: k A =0.7, k E =0.3 andm =, that is, we use all the annotations in the history as context for the disambiguation process. We ran the different algorithms over the corpus and then manually checked the correctness of the assignments entity-wikipedia page. The results of these experiments are shown in table 1 where the number of right assignments entity/page are shown. First represents the first result algorithm, Sim the text similarity algorithm, Links IdRank using only the anchor links information and All IdRank taking all the information into account. Table 1. Evaluation results First Sim Links All Related Work There are several approaches in the state of the art dealing with named entity disambiguation. These different approaches can be characterized according to a number of criteria. One of these possible criteria is the context used to disambiguate the entity. Some approaches use the complete document [5] as context. Others use a number words before and after the entity like [10, 9]. Although some approaches use both common words and named entities as context [10], others suggest that better results can be obtained using as context only other named entities [9]. Another criteria is the use of knowledge sources like lexical databases, ontologies, etc. There are approaches that make use of such knowledge sources [4, 8] and approaches that try to cluster the named entities without any reference to an available list of possible instances [10, 9]. Finally we can further calssify the approaches with respect to the disambiguation algorithms used: statistical procedures [7, 10, 9], morphosyntactic analysis [9, 5], algorithms exploiting the information and structure provided by an ontology [8], etc. The usage of a semantic network ranking algorithm, which also takes into account the temporal component of users interests are the main differences of our approach compared with the ones in the state of the art. 4 Conclusions and Future Lines In this paper we introduced the IdRank algorithm developed to address the problem of named entity disambiguation. It is used for semantic annotation of Web resources taking Wikipedia as knowledge source. Though some initial results

6 Semantic Annotation of Web Resources 105 on evaluation are reported, more intensive tests need to be run, for instance in order to measure the influence of the parameters of the algorithm in the final results. Acknowledgements This work has been partially funded by the Spanish Ministry of Education and Science under contract ITACA (TSI C02-01). References 1. Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American, May Fernández N, Blázquez JM, Sánchez L, Luque V (2006) Exploiting Wikipedia in Integrating Semantic Annotation with Information Retrieval. In 4th Atlantic Web Intelligence Conference, AWIC Israel, June Page L, Brin S, Motwani R, Winograd T (1999) The PageRank Citation Ranking: Bringing Order to the Web. Stanford Technical Report available online at: Aswani N, Bontcheva K, Cunnigham H (2006) Mining Information for Instance Unification. In 5th International Semantic Web Conference. Ed. Springer, LNCS 4273, pp USA. November Bagga A, Baldwin B (1998) Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In 17th International Conference on Computational Linguistics. Canada. August Ginter F, Boberg J, Ärvinen J, Salakoski T (2004) New Techniques for Disambiguation in Natural Language and their Applications to Biological Text. Journal of Machine Learning Research, 5: , Han H, Giles L, Zha H, Li C, Tsioutsiouliklis K (2004) Two Supervised Learning Approaches for Name Disambiguation in Author Citations. In Joint ACM/IEEE Conference on Digital Libraries. USA. June Hassell J, Aleman-Meza B, Arpinar IB (2006) Ontology-Driven Automatic Entity Disambiguation in Unstructured Text. In 5th International Semantic Web Conference. Ed. Springer, LNCS 4273, pp USA. November Mann GS, Yarowski D (2003) Unsupervised Personal Name Disambiguation. In 7th Conference on Natural Language Learning. Canada. June Pedersen T, Purandare A, Kulkarni A (2005) Name Discrimination by Clustering Similar Contexts. In 6th International Conference on Computational Linguistics and Intelligent Text Processing. Ed. Springer, LNCS Mexico. February 2005.

IdentityRank: Named Entity Disambiguation in the Context of the NEWS Project

IdentityRank: Named Entity Disambiguation in the Context of the NEWS Project Norberto Fernández 1,JoséM.Blázquez 1,LuisSánchez 1, and Ansgar Bernardi 2 1 Carlos III University of Madrid, Leganés, Madrid,