A Semantic Similarity Measure for Linked Data: An Information Content-Based Approach

Size: px

Start display at page:

Download "A Semantic Similarity Measure for Linked Data: An Information Content-Based Approach"

Jeremy Glenn
6 years ago
Views:

1 A Semantic Similarity Measure for Linked Data: An Information Content-Based Approach Rouzbeh Meymandpour *, Joseph G. Davis School of Information Technologies, The University of Sydney, Sydney, Australia Abstract. Linked Data allows structured data to be published in a standard way that datasets from various domains can be interlinked. By leveraging Semantic Web standards and technologies, a growing amount of semantic content has been published on the Web as Linked Open Data (LOD). The LOD cloud has made available a large volume of structured data in a range of domains via liberal licenses. The semantic content of LOD in conjunction with the advanced searching and querying mechanisms provided by SPARQL has opened up unprecedented opportunities not only for enhancing existing applications, but also for developing new and innovative intelligent semantic applications. However, query-based information retrieval techniques are inadequate to deal with functionalities such as comparing, prioritizing, and ranking search results which are fundamental to some of the more innovative applications of Linked Data such as recommendation provision, matchmaking, social network analysis, visualization, and data clustering. This paper addresses this problem by building on a systematic and accurate measurement model of semantic similarity between resources. By drawing extensively on a feature-based definition of Linked Data, it proposes an information content-based approach that improves on previous methods which are restricted to specific application domains and are generally less relevant in the context of Linked Data. It is validated and evaluated for measuring item similarity in recommender systems. The experimental evaluation of the proposed measure shows that it outperforms the comparable recommender systems that use conventional similarity measures based on collaborative and content-based filtering. Keywords: Linked Data, Linked Open Data, Similarity Measures, Semantic Similarity, Information Content, Ranking, Recommender Systems, Collaborative Filtering, Content-Based Filtering. 1. Introduction The rapid development of Semantic Web technologies such as resource description framework (RDF) [1] have enabled the publication of structured data in a standard way that can be readily consumed and reused by machines and shared across diverse applications. This has transformed the conventional Web of Documents, associated with Web 1.0, into the Web of Data (also refer to as Linked Data) publishing and interlinking structured data on the Web. Linked Data can be private or public. It can be used inside organizations and enterprises, and shared among business partners to provide easier integration and to facilitate interoperability. Linked Data can also be open. Linked Open Data (LOD) is a recent community-driven effort that provides access to a large and increasing amount of diverse structured data using open Semantic Web standards and through liberal licenses [2]. The LOD cloud 1 provides free access to 570 datasets 2 in areas such as media, geography, government, publications and life sciences. Using Semantic Web standards and LOD protocols (see Berners- Lee [3]), these datasets are publicly available for machine and human consumption. This not only offers unprecedented opportunities for developing novel and innovative applications, but also makes application development more efficient and cost-effective. * Corresponding author. rouzbeh.meymandpour@sydney.edu.au As of October 2014

2 In order for Semantic Web-based applications to be able to systematically search, retrieve and analyze Linked Data specific tools and technologies are required. Semantic Web crawlers and search engines are useful tools for browsing and searching semantic data (e.g. Swoogle search engine [4] and Semantic Web Search Engine [SWSE] [5]). In addition, SPARQL [6] enables querying the Semantic Web and Linked Data. However, it is unable to deal with issues such as prioritizing and ranking search results. A limitation associated with query-based information retrieval techniques is that they cannot answer the questions regarding which of the retrieved results better match the reference query. This is fundamental to some of interesting applications of Linked Data such as recommendation provision, matchmaking, social network analysis, visualization, semantic navigation, and data clustering. These require specific measures to analyze and compare entities in Linked Data. This paper addresses this problem based on a systematic assessment of semantic similarity between entities. Similarity measures evaluate the degree of overlap between entities based on a set of pre-defined factors such as taxonomic relationships, particular characteristics of the entities or statistical information derived from the underlying knowledge base. They have been proposed and used in diverse areas such as cognitive psychology, computational linguistics, artificial intelligence (AI), and natural language processing (NLP) to assess the similarity (or dissimilarity) between domain concepts or entities. However, besides the fact that each similarity measure is dependent on the implicit or explicit assumptions in its design and formulation, they are largely limited to the specifications and knowledge representation models of particular application domains. These assumptions make them less applicable in the Linked Data context. This paper first provides an overview of previous approaches for semantic similarity measurement (Section 2) and describes their limitations in the new context of Linked Data (Section 3). By drawing on a formal, mathematical definition of Linked Data, we proceed to present our LOD-based semantic similarity measure (Section 4). In order to validate the proposed measure and to demonstrate its applicability and value, it is applied for developing a LOD-based recommender system (Section 5). We compare the performance of our recommender system with that of conventional and state-of-the-art systems. Finally, we conclude this paper by discussing the limitations of this study (Section 0), providing a review of the related work on LOD-based similarity measurement and recommendation provision (Section 7) and outlining the future research directions that can be built upon this work (Section 8). 2. Semantic Similarity Measurement As a fundamental basis to theories of perception, behavior, social bonding, learning and judgment, the notion of similarity has been extensively studied for several decades. It has been discussed by famous philosophers such as Plato and Aristotle (the law of similarity principle in associationism, Aristotle 350 B.C.E) and investigated by distinguished psychologists such as Shepard [7], Tversky [8, 9], and Nosofsky [10, 11]. In cognitive psychology, similarity is defined as the degree of resemblance between two perceptual or conceptual objects. People tend to perceive objects that are similar as a group (the law of similarity principle in Gestalt psychology). Many researchers have endeavored to understand and mirror the way humans judge the similarity of two or more objects. Drawn from extensive studies on factors related to the human perception of similarity in psychology (e.g. see Goldstone [12] and Decock and Douven [13]), computer scientists have developed systematic methods to evaluate the level of similarity among various objects of interest. Semantic similarity reflects the relationship between the meaning of two concepts, entities, terms, sentences or documents. It is computed based on a set of factors derived typically from a knowledge representation model. Thesauri, taxonomies, and ontologies are among the main models used for representing the domain knowledge. Therefore, similarity measures are generally domain-dependent; depending on the structure of the application context and its knowledge representation model, various measures have been proposed. Semantic similarity measures can be classified into four main categories: 1) distance-based models that are based on the structural representation of the underlying context, 2) feature-based models that define concepts or entities as sets of features, 3) statistical methods that consider statistics derived from the underlying context, and 4) hybrid models that comprise combinations of the three basic categories. The following sections review the main approaches for semantic similarity measurement. The related work section (Sec-

3 tion 7) also provides a detailed review of the key methods for similarity measurement as it relates to Linked Data Distance-based Similarity Measures Distance-based models, also referred to as geometric models or mental distance approaches in psychology [7] and edge-counting or path-based methods in graph-based representations [14], define similarity as a function of distance between concepts. Distance metrics satisfy the following mathematical properties: 1) Non-negativity: d(a, b) 0 2) Coincidence axiom: d(a, b) = 0 if and only if a = b 3) Symmetry: d(a, b) = d(b, a) 4) Triangle inequality: d(a, c) d(a, b) + d(b, c) Distance in Multidimensional Spaces Shepard [7] proposed a geometric model for similarity assessment where objects are represented as points in a multidimensional similarity space (e.g. size, color, and shape are among the dimensions of real-world objects). The more proximate two concepts are in the underlying similarity space, the more similar they are. Given a multidimensional space and points that represent objects of interest, various distance functions such as the Euclidean, Manhattan, and Minkowski distance metrics can be employed to measure the distance between them. When objects in a particular domain of interest are not explicitly represented in a multidimensional space, various techniques can be applied to derive the spaces. For example, in information retrieval (IR), vector space model (VSM) [15] is a common way for representing documents based on their terms. The frequency of terms in a document (term frequency [TF]) is a simple way to represent each document: each term corresponds to a dimension and a vector of term frequency values denotes each document. The similarity between two documents can be calculated based on the cosine of the angle between their vectors: Cosine(A, B) = n i=1 A B A B = tf(i, A) tf(i, B) n (tf(i, A)) 2 i=1 n (tf(i, B)) 2 i=1 (1) where A and B are the vectors representing the two documents in an n-dimensional space (n is the number of terms) and tf(i, A) and tf(b, i) are term frequency values of the term i in the documents. A B is the dot product (intersection) of the two vectors and A and B are the norm of the vectors. Other related methods for obtaining vector spaces include term frequency-inverse document frequency (TF-IDF), multidimensional scaling (as used by Shepard [7] and Micko [16], among others, also see Nosofsky [10]), latent semantic analysis (LSA) (also known as latent semantic indexing [LSI]) [17, 18], and topic models [19, 20] Distance in Semantic Networks According to the semantic network model proposed by Quillian [21], concepts and relationships among them can be denoted as nodes and links, respectively. In this model, is-a (superordinate and subordinate) relations play a more important role than other types of relations. In taxonomies, where information is structured in a hierarchical manner using is-a relations (see Figure 1), the distance between nodes (number of edges) can provide an estimate of their mutual similarity. These distance functions are also called edge counting or path-based methods. For example, in Figure 1, the node i is closer to e than c. Thus, i tends to be more similar to e. The logic is that the lower the hierarchical distance, the higher the similarity: Similarity = Distance a root c d e f g Figure 1. Link structure of a sample is-a taxonomy h b i (2) An extensive body of literature has explored the measurement of semantic similarity using the ontological knowledge of WordNet a lexical taxonomy of English words widely used in computational linguistics, AI and NLP [22]. It provides definitions as well as several semantic relations between words. It categorizes words into nouns, verbs, adjectives and adverbs, and groups them into synonym sets, called

4 synsets. Synsets are connected using two main types of semantic relations is-a and part-of (member of) relations. Is-a relations include hypernym (superclass) and hyponym (subclass) relations. For example, eagle is a hyponym of bird ( eagle is a bird or eagle is a subclass of bird ) and animal is a hypernym of bird ( animal is a superclass of bird ). Partof relations include holonymy and meronym relations. As an example of the latter, hand is a meronym of (part of, member of) body and body is a holonym of hand. However, most of the edge counting methods consider only is-a links, which define the concepts, whereas part-of relations characterize them [14]. A basic technique for measuring semantic distance between pairs of words represented in a is-a taxonomy has been proposed by Rada et al. [14]. It describes the conceptual distance as the shortest path length connecting any two nodes: Rada Distance (a, b) = δ(a, b) (3) where δ(a, b) is the shortest path length between a and b. It is based on this idea that in is-a hierarchies terms that are close together are more similar to each other. Other path-based methods incorporate the relative depth of concepts in a given taxonomy into semantic similarity assessment [23, 24, 25]. For example, Wu and Palmer [25] presented the notion of conceptual similarity for verb selection. In this approach, the depth is calculated by counting edges that separate terms from their least common subsumer (LCS) (also known as the most recent common ancestor [MRCA]) the nearest superclass that both concepts share. For example, in Figure 1, the LCS of nodes i and h, denoted by i,h, is f, while i,c = root. Wu and Palmer [25] s metric relies on the fact that in is-a hierarchies, concepts that are more distant from the root are more specific than the ones that are near the root: WuPalmer(a, b) = 2 δ( a,b, ρ) δ(a, a,b ) + δ(b, a,b ) + 2 δ( a,b, ρ) (4) where the function δ(, ) calculates the number of edges separating two nodes, a,b is the most recent superclass of a and b, and ρ is the root of the taxonomy. For example, in Figure 1, the nodes i and h are more similar to each other (WuPalmer(i, h) = 2 ) than f 3 and g (WuPalmer(f, g) = 1 ). 2 A related similarity metric designed for directed graphs is SimRank, proposed by Jeh and Widom [26]. It has been widely used for finding similar Web pages connected by hyperlinks. In this method, the similarity of two nodes is computed based on the similarity between their neighbors by considering the number of outgoing or incoming links. SimRank [26] and other widely used methods such as PageRank [27], HITS [28], Co-citation [29] and SALSA [30] are designed for link-based graphs such as Web or citation networks. In these cases, nodes (i.e. Web pages or academic papers) are connected to each other using one type of link (i.e. hyperlinks in the case of Web and cites in citation networks). Some authors also attempted to modify the link-based methods for the WordNet graph. For example, the Personalized PageRank [31] is used for word sense disambiguation (WSD) using WordNet [32, 33]. However, these methods do not explicitly consider the type of the links in their approaches; all link types have the same weight. Although they can be applied in more complex semantic networks such as Linked Data where nodes are connected using multiple types of semantic relations, different types of links will be regarded as the same. As a result, the semantics of the specific relations will be partially discounted. As discussed in the foregoing, the applicability of the presented edge counting-based metrics is restricted to is-a hierarchies or networks with one type of link. Other approaches have attempted to handle the variety in link types, for example, by weighting the links based on their characteristics in order to reflect their relative importance in similarity evaluation [34, 35, 36]. However, they are largely limited to the relations in the WordNet lexical database Feature-Based Similarity Measures Tversky [9] has discussed the limitations of distance-based methods conducted empirical studies from a psychological point of view. One of the limitations is the symmetry assumption; if a is similar to b then b is also similar to a. In multidimensional similarity spaces and hierarchical structures, the distance between points or nodes is always the same regardless of the point from which the measurement starts. Tversky [9] argued that psychological concepts and, therefore, humans similarity judgments are not always symmetrical; the direction of the similarity statements is essential. For example, we usually say, the son resembles the father instead of the father resembles the son [9:328]. Based on this, Tversky [9]

5 proposed a feature-based model of similarity as the solution. As introduced by Tversky [9], feature-based methods assume that concepts can be represented as sets of features. They assess the similarity of concepts based on the commonalities among their feature sets: any increase in common features among concepts results in a higher similarity score and any decrease in shared features results in lower levels of similarity. Based on this, set-based indices such as Jaccard [37] and Dice [38] coefficients can be adopted for similarity assessment. For example, the Jaccard index of two sets is the ratio of shared features to all features: Jaccard(A, B) = A B A B (5) such that A and B denote the sets of features correspond to concepts a and b. In addition to common features, the Tversky [9] ratio model, which is a generalization of the Jaccard and Dice models, also considers the distinctive characteristics of each concept (the features of one concept which are not part of the other): Tversky(A, B) = A B A B + α A B + β B A for α, β > 0 (6) where α and β represent the relative contribution of unique features of A and B in the similarity value, respectively. For example, for α > β, distinctive features of A are weighted higher than that of B. The α and β parameters can be used to reflect the symmetric or asymmetric nature of a given context: if α = β then Tversky(A, B) = Tversky(B, A) thus, the similarity comparison is symmetric, otherwise, it is asymmetric (Tversky(A, B) Tversky(B, A)). Feature-based models are applicable in contexts in which entities are or can be represented as sets of features. In other situations, such as hierarchical structures of information or ontologies, features need to be explicitly defined for domain concepts or entities. For example, when working with taxonomies, some authors define features of an entity as the set of its superor sub-classes and employ Jaccard or Tversky indices to determine the overall similarity value [39, 40, 41]. The simplicity and flexibility of feature-based methods enable them to be easily combined with other ap- proaches. In Section 2.4 below, we will review a number of hybrid approaches that merge the benefits of feature-based models with other approaches Statistical Similarity Measures Statistical similarity measures incorporate statistics derived from various aspects of the underlying domain into the similarity computation. Several approaches use the frequency of terms in a document as a measure of their informativeness also known as information content (IC) (see Section 4) and use that as a basis for measuring the similarity [42, 43, 44]. For example, Resnik [42] considers the popularity of the LCS of two terms as a measure of their similarity: two terms that share a LCS that is more popular in corpus 3 are considered less similar than two terms that share a less frequent LCS. It is based on the assumption that terms that are more frequently used (such as I, me, the, etc.) are more general and less informative than less common words. These methods tend to show better results compared to feature-based and edge countingbased semantic similarity measures [42, 43, 44]; they showed a higher level of correlation with human judgments. As described, a large proportion of the approaches proposed for measuring semantic similarity has used the WordNet lexical database as the main knowledge base. However, there are several problems associated with using WordNet such as its limited applicability and the lack of technical, domain-specific terms. To overcome these issues, many authors employed large text corpora for measuring semantic similarity. For example, when working with a text corpus, pointwise mutual information (PMI) [46] can be applied for measuring semantic similarity between words [47]. PMI is based on the ratio of the number of co-occurrences of two terms together to their individual occurrences in a document. For example, words such as doctors, dentists, nurses, treating, and hospitals are highly associated because the often appear together in the same document [47]. Although PMI can be applied using a document or set of documents, several approaches (e.g. PMI-IR [48], Etzioni et al. [49], SOC-PMI [50], and Newman et al. [51]) have used online sources such as the search results of Google and Wikipedia as the main corpus. Vector-based methods such as latent semantic analysis (LSA) [17] and explicit semantic analysis (ESA) 3 Resnik [42] calculated the frequency of concepts using the Brown Corpus of American English (also referred to as Brown Corpus), a large collection of text compiled from works published in the United States in 1961 ranging from news articles to science fiction [45].

6 [52] can also be classified as statistical semantic similarity measures. LSA applies singular value decomposition (SVD) to term-document matrices where each cell contains the frequency of the corresponding word in a document. ESA represents words as weighted vectors of concepts derived from Wikipedia articles. Compared to LSA, ESA showed a higher correlation with human judgment for estimating relatedness of words [52] Hybrid Similarity Measures This section gives an overview of a number of approaches that can be classified as hybrid methods: they are based on combinations of the three main methods. Hu et al. [53] combine feature-based methods with distance functions by representing the features of entities in ontologies using description logic and measuring the similarity using a vector-based cosine similarity measure. A number of approaches combine feature-based and statistical methods. For example, in isa taxonomies, intrinsic information content (IIC) [54] incorporates the number of subclasses of a concept for estimating the information content: the higher the number of subclasses of a term, the lower its informativeness. IIC have also been combined with featurebased [39, 40, 55] and edge counting [56] methods. For example, instead of WordNet, WikiRelate! [56] applies IIC on the category hierarchy of Wikipedia for estimating the semantic relatedness of a pair of words. Milne [57] used the Wikipedia link structure to create a vector model for computing the relatedness. Wiki- Walk! [58] is another Wikipedia-based relatedness measure that employs a combination of Personalized PageRank [31] and ESA [59]. It showed better results compared to WordNet-based and other Wikipediabased methods such as Wikipedia link measure (WLM) [60] and WikiRelate! [56]. In another study [61], the authors presented an approach for computing semantic relatedness using multilingual semantic graphs created by integrating concepts from WordNet and Wikipedia. 3. Limitations of Previous Approaches for Semantic Similarity Measurement on Linked Data In the previous section, we reviewed some of the existing similarity measures proposed in the literature in various domains. We studied three main categories 4 rdf: is the prefix for the namespace of approaches, namely, distance-based metrics, feature-based models and statistical methods as well as hybrid approaches. However, owing to the existence of heterogeneous relationships between resources and the unique graph structure of Linked Data, our contention is that the existing similarity measures and metrics developed primarily for taxonomies such as WordNet are not the most suitable measures in this new context. The Linked Data graph is a complex semantic network in which information resources (nodes) are connected by a wide range of semantic relations (edges). Unlike WordNet, Linked Data has a wide range of relations of which is-a and part-of are two particular types. Therefore, any measure of semantic similarity for Linked Data has to consider its particular characteristics such the variety in link types and the direction of the relations. Distance-based metrics deal only with is-a relations, while Linked Data is characterized by many different kinds of links, of which the is-a relation (expressed by rdf: type 4 and rdfs: subclassof 5 properties) is only one type and thus is unable to describe the resources adequately. Moreover, approaches that consider various types of links [34, 35, 36] are limited to the relations in WordNet. In addition, although methods such as SimRank [26] and PageRank [27] can be applied based on the link structure of Linked Data, they do not consider the semantics represented using various types of relations and all link types have the same weight in the similarity measurement. A wide range of approaches based on WordNet such as the metrics proposed by Wu and Palmer [25] and Leacock [23, 24], and statistical methods proposed by Resnik [42], Jiang and Conrath [43], and Lin [44] determine the semantic similarity based on the least common subsumer (LCS). Their applicability to the Linked Data graph structure (see Figure 2 (b) and (c)) is limited. In the tree-based structure of lexical taxonomies such as WordNet, concepts or entities are connected in a hierarchical manner (Figure 2 (a)), while the is-a link structure of Linked Data is different: resources can be subsumed by multiple classes. Therefore, a resource can have multiple parents (see Figure 2 (b) and refer to Section for an example). Moreover, in the Linked Data graph, resources are linked via multiple incoming and outgoing edges (Figure 2 (c)). This graph structure makes the LCS less relevant for the Linked Data context. 5 rdfs: is the prefix for the namespace

7 root root a b a b a b f thing c c d e f g c d e f g h e d (a) (b) (c) Figure 2. Difference between (a) a hierarchical taxonomy, (b) the is-a link structure of Linked Data and (c) a sample Linked Data graph Finally, the Linked Data graph is a complex network in which information about resources is not explicitly represented in sets of features. Therefore, despite their simplicity and flexibility, feature-based measures such as Tversky [9], Jaccard [37], and Dice [38] cannot be readily adopted. In the next section, we propose a feature-based definition of Linked Data which will be combined with statistical models to develop a semantic similarity measure that considers the particular characteristics of Linked Data. 4. A Hybrid Semantic Similarity Measure for Linked Data Similarity measurement is fundamental to a wide range of Linked Data applications including entity comparison and ranking, ranking of search results, recommender systems and, data clustering and visualization. Having reviewed previous approaches of similarity measurement and discussed their limitations in the context of Linked Data, this section presents a feature-based definition of Linked Data that considers its specific characteristics and proposes a semantic similarity measure which is a hybrid of feature-based and statistical approaches Formal Definition of Linked Data Tim Berners-Lee, inventor of the Semantic Web and World Wide Web (WWW), used the term Giant Global Graph (GGG) on his blog in 2007 to refer to the new environment enabled by Semantic Web technologies [62]. Similar to the social graph of social networks where people are connected based on their relationships and interests, in GGG information resources are linked based on semantic relations among them. Linked Data is a massive collection of RDF statements related to various entities of interests such as movies, artists, actors, cities, etc. known as information resources or resources, for short. Each RDF statement (also known as triple) is in the form of subject-predicate-object. In RDF, subjects, predicates and objects are uniquely identifiable using URIs (uniform resource identifiers). RDF statements can be represented as nodes and edges where the subject and the object are the nodes, and the relations (predicates) between them are the edges connecting the nodes. The edges are directed, meaning that the direction of the links is part of the definition of the relation. Moreover, considering that a subject can be connected to several objects to express various statements, and that an object of a statement can also be the subject of another statement, the graph representation of RDF statements describing Linked Data forms a massive graph of interconnected nodes, referred to as the Giant Global Graph. Thereby, we describe Linked Data as a graph of resources and the relations among them: Definition 1. (Linked Data): Linked Data (LD) is a labelled directed graph, defined as R, L, T, such that R = {r 1, r 2,, r R } is a set of resources (nodes, vertices), L = {l 1, l 2,, l L } is a set of links (edges, relations, predicates) and T = {t 1, t 2,, t T } is a set of triples (statements) such as r 1, l 1, r 2, where l 1 L is a link from r 1 R to r 2 R. Based on this definition, resources can be defined according to their neighbors, that is, their relationships with other resources in the Linked Data graph. We define a resource in Linked Data as a set of its features; the statements in which the resource is participating as the subject or the object:

8 Definition 2. (Features in Linked Data): A feature f of the resource r R in Linked Data (LD) is denoted as a triple of kind l, r t, D, where r t R is the (target) resource directly connected to r via the link l L and D is the direction of the link (In/Out). Hence, we define resources based on the notion of features in Linked Data: Definition 3. (A Resource in Linked Data): A resource r R in Linked Data (LD) is denoted as a set of its features F r, defined as follows: F r = F r Out F r In (7) F r Out = { l i, r i, Out, l i L, r i R r, l i, r i LD} (8) F r In = { l i, r i, In, l i L, r i R r i, l i, r LD} (9) In this definition, incoming and outgoing relations of the resource, the type of the relation, the direction of the relation and the target node (the node connected to the other end of the relation) are considered in the description of the resource. As a simple illustration, the features of nodes r and s in Figure 3 below are the sets F r and F s, respectively: F r = {(l 1, a, Out), (l 2, b, In), (l 3, c, Out), (l 4, d, Out)} F s = {(l 2, b, In), (l 4, c, Out), (l 4, d, Out), (l 5, e, Out)}. Also, F r F s = {(l 2, b, In), (l 4, d, Out)}. In this section, we have presented a mathematical definition of Linked Data and resources in Linked Data. In this definition, resources are defined based on their features drawn using their incoming and outgoing relations. This definition provides us with a simple, yet flexible basis for developing Linked Databased measures in the following sections. l 1 a b l 2 l 2 r l 3 c l 4 s l 4 Figure 3. An example of resources and features in the Linked Data graph (r, s, a, b, e are sample resources and l 1, l 2, l 5 are sample links) d e l 4 l A Hybrid Approach for Semantic Similarity Measurement As discussed earlier, Tversky [9] has characterized concepts as representable as sets of features and the commonalities between the features of two concepts can be used as a measure of their similarity (see Section 2.2). However, a main drawback associated with feature-based methods is that they treat all features similarly: all features are weighted the same in the similarity evaluation. Several empirical studies have shown that factors influencing the human s similarity judgment have various levels of importance [63, 64, 65]. These studies showed that the level of importance varies according to the psychological stimuli of the comparison-maker and contextual variables. Statistical models of similarity incorporate statistics about the underlying context into the semantic similarity comparison in order to reflect the relative importance of the influencing factors. Our approach is based on the information content (IC) of features, that is, the relative informativeness of features. Therefore, the importance of the factors influencing the similarity judgment (i.e. features) is derived based on their informativeness, that is, the amount of information conveyed by their presence Information Content Measurement in Linked Data Information theory, as proposed by Shannon [66], describes the mathematical foundations of communication; transmitting information over communication channels by means of coding schemes [67, 68]. Based on earlier work by Hartley [69], Shannon s key idea was to define information as a measurable mathematical quantity, information content (IC). Shannon [66] presents information content as a measure of information conveyed by the occurrence of a random event chosen from a set of possible events. IC is defined as the logarithm of the inverse of the event s probability: IC(x) = log ( 1 ) = log(π(x)) (10) π(x) IC(x) is the amount of information produced with the occurrence of the event x based on its probability π(x): the higher the possibility of an event, the lower its information content. The logic can be generalized to various domains so that common alphabets in a message, frequent messages in a collection of possible messages or frequent terms in a textual document carry less information compared to less frequent ones.

9 In other words, occurrence of less frequent events conveys more information; therefore, they are more informative. The logarithm in Equation (10) is usually to the base two. Therefore, it is measured in units of information called bits. Other bases can also be used. For example, for the base 10 and natural logarithms, the unit of information is called bans (decimal digits or Harleys) and nats (or natural units), respectively. Other bases can also be easily converted to each other ( log b (a) = log 2 (a) ). However, the base two is the log 2 (b) most common case. The concept of information content has been widely used in a number of areas. For example, in data compression, the more frequent terms in a corpus are considered to be less informative. Therefore, they can be stored using fewer bits. Similarly, in variable-length source coding, symbols in the source message that are more common are sent using fewer bits and those that are less frequent are transmitted using more bits. In other domains, the probability of events may not be measured based on their frequency. For example, in hierarchical taxonomies of nouns (such as WordNet), the terms with a higher number of subclasses (children) are considered to be less informative [54]. In the next sections, we extend the notion of information content to Linked Data. In the following sections, we first propose a measure derived from the formal definition of Linked Data and the principles of information content measurement, to assess the value of information associated with features in Linked Data. Based on this, we proceed to define the aggregate information content of resources according to their set of features. The measure of information content of resources will be used as a basis for semantic similarity measurement using Linked Data Information Content of Features in Linked Data Based on Definition 3, a resource in Linked Data can be described using its set of features, that is, by having its characteristics defined as a collection of its incoming and outgoing relations. As explained previously, the type of the relation, the target node and the direction of the relation are considered in the definition of features (Definition 2). Based on the probability theory foundations of information content, we define the IC value associated with a feature in Linked Data as follows: Definition 4. (The Information Content of Features in Linked Data): Let π(f) be the probability of the feature f in Linked Data (LD). The information content of f is defined as: IC(f) = log ( 1 ) = log(π(f)) (11) π(f) The probability of a feature can be computed based on its relative frequency: the ratio of the number of resources with the feature to the total number of resources: π(f i ) = φ(f i) N (12) where φ(f i ) is the frequency of the feature f i and N is the total number of resources. Similar to Shannon s [66] and Hartley s [69] logarithmic measure of information content (Equation (10)), the proposed measure of the information content of features in Linked Data (computed using Equations (11) and (12)) satisfy the following properties (see Mézard and Montanari [70]). For a feature f: 1) IC(f) 0 2) IC(f) = 0 if and only if the feature f is certain, that is, its probability is one (π(f) = 1). In other words, for the feature f, which is shared between all resources in the underlying Linked Data, the amount of information conveyed by its occurrence is zero. 3) IC(f) is maximum when the feature f only occurs once. Thus, for the feature f where φ(f) = 1 and π(f) = 1, IC(f) is maximum and equal to N log(n). Hence, the amount of information content of a feature is always less than or equal to log(n). 4) Additivity, that is, for any two features in Linked Data (which are mutually independent; i.e. the occurrence of one feature does not depend on the occurrence of the other), the information content associated with the occurrence of both features is equal to the sum of the IC values of the two features: IC(f 1, f 2 ) = log(π(f 1, f 2 )) = log(π(f 1 ) π(f 2 )) = log(π(f 1 )) log(π(f 2 )) = IC(f 1 ) + IC(f 2 ) (13)

10 These mathematical properties of the proposed information content measure of features in Linked Data are able to be used to further study and extend the measure. As an example of the second property, all resources in DBpedia are an owl:thing that is defined using the feature (rdf: type, owl: Thing, Out). 6 Therefore, its IC value is zero; no information is produced by its occurrence. The fourth property is able to be extended to a set of features in order to compute the IC value of resources in Linked Data. In the next section, we employ this property and introduce the partitioned information content (PIC) of resources in Linked Data Partitioned Information Content of Resources in Linked Data The proposed logarithmic computation of the information content of features (Equations (11) and (12)) has the additivity property, which implies that the information content of two features is equal to the sum of their IC values. This section extends this property to develop a measure of the information content of resources in Linked Data: Definition 5. (The Probability of a Set of Features in Linked Data): Let π(f) be the probability of the feature f in Linked Data (LD). For the resource r ε R, represented as a set of (mutually independent) features F r = {f 1, f 2,, f fr }, the probability of the set F r is defined as π(f r ) = π(f 1 ) π(f 2 ) π(f fr ) = π(f i ) f i F r (14) Having defined the information content of a feature (Equation (11)) and the probability of a set of features (Equation (14)), we can measure the information content of a resource (r) based on the set of its features (F r ): log(π(f r )) = (15) log ( π(f i )) = log(π(f i )) f i F r f i F r We express this measure as the partitioned information content (PIC) of a resource in Linked Data [71]: Definition 6. (Partitioned Information Content in Linked Data): The information content of a resource in Linked Data is defined as the sum of the information content values of its features: PIC(r) = IC(f i ) f i F r (16) In this definition, IC(f i ) is computed by Equations (11) and (12). The PIC measure can be summarized as follows: PIC(r) = log ( φ(f i ) N ) f i F r (17) The partitioned information content (PIC) of resources in Linked Data is the aggregate amount of information content conveyed by a given resource (PIC(r) 0) and is based on the information content of features of the resource. Owing to the use of the base two for the logarithm function, PIC is measured in units of information, that is, bits. The characteristics of PIC are derived from its information theory fundamentals. Equations (11) and (12) are premised on the notion that highly probable features are general and less informative, while distinctive features, that is, features with a low number of occurrences, are more specific and convey more information. For example, based on the frequency of features, the fact that all actors are a Person (specified using the feature (rdf: type, foaf: Person, Out)) is substantially more popular than the fact that a particular actor starred in a movie (specified using the feature (starring, movieuri, In) ). The former applies to millions of resources in DBpedia that describe a person, while the latter is only used when representing the actors of the movie (specified with movieuri). The frequency of the latter is equal to the number of actors who starred in the movie; therefore, it is more informative than the former. The information content of features influences their contribution in the partitioned information content of the resource to which they belong. The PIC measure, computed using Equation (17), implies that popular features, shared between a large number of resources in Linked Data (e.g. being a person), contribute less to the PIC of resources than infrequent ones (e.g. the actors of a particular movie). As a result, resources with more distinctive features are more informative. 6 owl: is the prefix for

11 Partitioned Information Content across Datasets on the LOD Cloud The Linked Open Data (LOD) cloud 7 provides free access to more than 570 datasets in various areas such as media, geography, publications, life sciences and government. However, entities are often described in multiple datasets. For example, LinkedMDB 8 (Linked Movie Database) is part of the LOD cloud that provides structured information on over 85,600 movies [72]. These movies are also described in other LOD datasets such as DBpedia and Freebase. Another example is GeoNames 9 through which detailed semantics on geographic information are provided. These datasets are connected using the owl: sameas relations. The giant graph formed by linking these datasets is referred to as the Linked Open Data cloud. In order to leverage the potential power of Linked Open Data, we extend PIC to include semantics about resources from various datasets. As resources are often described in multiple datasets in the LOD cloud, valuable information can be obtained from a variety of sources. Therefore, the partitioned information content (PIC) of a resource is the aggregation of its PIC values in all datasets in the whole LOD cloud: PIC LOD (r) = PIC LD (r) LD LOD (18) where PIC LD (r) is computed separately for each dataset in LOD using Equation (17). Datasets considered in the LOD cloud-based PIC computation can be datasets such as DBpedia, Freebase, LinkedMDB, MusicBrainz, etc. or datasets that provide semantics in multiple languages. For example, localized editions of DBpedia 3.8 are published in 111 languages. 10 These datasets can be included to compensate for the missing information or to add information on entities that are described better in languages other than in English PICSS: Partitioned Information Content (PIC)-Based Semantic Similarity Measure We propose our partitioned information content (PIC)-based semantic similarity measure, called PICSS, which is a combination of feature- and information content-based approaches. This measure not only takes into account all types of relations but also adjusts the influence of features in the similarity value based on their informativeness. Given the notion of the information content of features in Linked Data presented in Section and the partitioned information content (PIC) proposed in Section 4.2.3, we employ the Tversky [9] ratio model and propose PICSS, a PIC-based semantic similarity measure for Linked Data: Definition 7. (PICSS a PIC-based Semantic Similarity Measure for Linked Data): Similarity of two resources, r, s ε R, represented as sets of their features F r and F s, respectively, is defined as: PICSS(r, s) = PIC(F r F s ) PIC(F r F s ) + PIC(F r F s ) + PIC(F r F s ) (19) The similarity scores computed by PICSS are normalized between zero and one, where the score of zero represents no similarity between resources (perfectly dissimilar resources) and one represents a perfect similarity (identical resources). 11 Based on the information theoretic foundations of the measure, as less frequent features are considered to be more informative, the fact that they are shared between the resources is more influential than would be the case for frequent, less informative features. An important characteristic of PICSS is that the similarity value increases with more shared features and decreases with differences between resources. Tversky [9:330] used block letters to illustrate the importance of considering differences as well as commonalities in similarity assessment: let us assume that each block letter can be represented with a set of straight lines. Therefore, for example, the only feature of I is one vertical line, while the features of E are one vertical and three horizontal lines. Based on this assumption, I is more similar to F than to E : despite the fact that the same feature (one vertical line) is shared between I and the others, as I and F have fewer distinctive features (three) they are considered to be more similar (in contrast to four distinctive features of I and E ). However, if distinctive features were not considered in the similarity measure, I would be equally similar to both F and E. Therefore, based on Tversky s theory of similarity, any increase in commonalities and/or decrease in differences between entities lead to a higher similarity. In order to accurately compute the similarity, PICSS considers both shared features and distinctive features (i.e. the See 11 In this model, we assume that the similarity is symmetric.

12 features of one resource which are not part of the other) of resources in similarity computation. PICSS combines the advantages of feature- and information content-based measures. It enables applications to perform in-depth semantic analysis of entities based on structured data acquired from Linked Open Data. In the following sections, we explain how PIC and PICSS can be implemented using SPARQL queries and present examples demonstrating their performance Implementation In order to compute our proposed semantic similarity measure, PICSS, a number of SPARQL queries need to be executed to measure the partitioned information content (PIC) of resources using Linked Data. To begin the analysis of resources in a particular domain of interest, we need to retrieve all resources of a certain type using a SPARQL query such as the one showed in Listing 1. SELECT?resource WHERE { } Listing 1. A SPARQL query to retrieve instances (resources) of a certain type (?resourcetype has to be replaced with a particular type from the DBpedia ontology depending on the domain, e.g. dbo:film, 12 dbo:musicalartis, etc.) In order to calculate PIC (Equation (17)), first, we need to compute the total number of resources (N). It is calculated depending on the ontological structure of the underlying Linked Data. For example, in DBpedia, all resources are a Thing. This is expressed using the feature (rdf: type, owl: Thing, Out). The total number of resources can be counted using the SPARQL query shown in Listing 2 below. In our experimental dataset (see Section 5.2), it was equal to 2,350,906. SELECT (COUNT(?resource) AS?N) WHERE {?resource rdf:type?resourcetype.?resource rdf:type owl:thing. We also need to extract the features of resources. Features can be retrieved using two simple SPARQL queries. Based on the definition of features in Linked Data (refer to Definition 2, Section 4.1). Listing 3 shows an example of retrieving outgoing relations of a given resource. A similar query can be executed to extract the incoming relations. Next, the information content of each feature needs to be computed based on its frequency. The SPARQL query presented in Listing 4 is used to retrieve the frequency of an outgoing feature. Finally, by aggregating the IC values of all features of a given resource, its PIC is computed. The same calculations and queries can be applied to compute the PIC of shared or distinctive features between two resources for computing our semantic similarity measure, PICSS (Equation (19)). SELECT DISTINCT?linkType?targetResource Out AS?linkDirection WHERE { } <resource>?linktype?targetresource. FILTER (!isliteral(?targetresource)) Listing 3. A SPARQL query to retrieve outgoing edges of a resource SELECT (COUNT(?resource) AS?freq) WHERE { }?resource?linktype?targetresource. FILTER (!isliteral(?targetresource)) Listing 4. A SPARQL query to retrieve the frequency of an outgoing feature For a better performance, these queries can be combined and executed in parallel Sample Output This section presents the results of our explanatory analysis of applying PICSS in a number of domains, namely, Films; Music, which is a collection of musical } Listing 2. A SPARQL query to retrieve the total number of resources in DBpedia 12 dbo: is the prefix for

NATURAL LANGUAGE PROCESSING

NATURAL LANGUAGE PROCESSING LESSON 9 : SEMANTIC SIMILARITY OUTLINE Semantic Relations Semantic Similarity Levels Sense Level Word Level Text Level WordNet-based Similarity Methods Hybrid Methods Similarity