Efficient semi-automated assessment of annotations trustworthiness

Size: px

Start display at page:

Download "Efficient semi-automated assessment of annotations trustworthiness"

Della Burns
5 years ago
Views:

1 Ceolin et al. Journal of Trust Management 2014, 1:3 RESEARCH Open Access Efficient semi-automated assessment of annotations trustworthiness Davide Ceolin *, Archana Nottamkandath * and Wan Fokkink *Correspondence: d.ceolin@vu.nl; a.nottamkandath@vu.nl The Network Institute, VU University Amsterdam, de Boelelaan, 1081a, 1081HV Amsterdam, The Netherlands Abstract Crowdsourcing provides a valuable means for accomplishing large amounts of work which may require a high level of expertise. We present an algorithm for computing the trustworthiness of user-contributed tags of artifacts, based on the reputation of the user, represented as a probability distribution, and on provenance of the tag. The algorithm only requires a small number of manually assessed tags, and computes two trust values for each tag, based on reputation and provenance. We moreover present a computationally cheaper adaptation of the algorithm, which clusters semantically similar tags in the training set, and builds an opinion on a new tag based on its semantic relatedness with respect to the medoids of the clusters. Also, we introduce an adaptation of the algorithm based on the use of provenance stereotypes as an alternative basis for the estimation. Two case studies from the cultural heritage domain show that the algorithms produce satisfactory results. Keywords: Trust; Annotations; Semantic similarity; Subjective logic; Clustering; Cultural heritage; Tagging; Crowdsourcing; Provenance Introduction Through the Web, cultural heritage institutions can reach large masses of people, with intentions varying from increasing visibility (and hence visitors) to acquiring usergenerated content. Crowdsourcing is an effective way to handle tasks which are highly demanding in terms of the amount of work needed to complete and required level of expertise [1], such as annotating artifacts in large cultural heritage collections. For this reason, many cultural heritage institutions have opened up their archives to ask the masses to help them in tagging or annotating their artifacts. In earlier years it was feasible for employees at the cultural heritage institutions to manually assess the quality of the tags entered by external users, since there were relatively few contributions from Web users. However, with the growth of the Web, the amount of data has become too large to be accurately dealt with by experts at the disposal of these institutions within a reasonable time. Nevertheless a high quality of annotations is vital for their business. The cultural heritage institutions need the annotations to be trustworthy in order to maintain their authoritative reputation. This calls for mechanisms to automate the annotation evaluation process in order to assist the cultural heritage institutions to obtain quality content from the Web. Annotations from external users can be either in the form of tags or free text, describing entities in the crowdsourced systems. Here, we focus on tags in 2014 Ceolin et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.

2 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 2 of 31 the cultural heritage domain, which describe mainly the content, context and facts about an artifact by associating words to it. The goal of this paper is to propose an algorithm for computing the trustworthiness of annotations in a fast and reliable manner. We focus on three main evaluation aspects for our algorithm. First, the trust values produced by our algorithm are meant as indicators of the trustworthiness of annotations, and therefore they must be accurate enough to warrant their usefulness. Accuracy of trust values is achieved by carefully handling the information at our disposal and by utilizing the existence of a relationship between the features considered (e.g., the annotation creator) and the trust values themselves. If the information is handled correctly and the relationship holds, then the trust values are accurate enough and form a basis to automatically decide whether or not to use the annotations. We evaluate this first research question by applying our algorithm on two different datasets, one from a SEALINC Media project experiment and the other from the Steve.Museum dataset. In both cases we divide the dataset into two parts, training set and test set, so as to build a model based on subjective logic and semantic similarity in the training set, and then evaluate the accuracy of such a model on the test set. The goal of the work described in this paper is to automate the process of evaluation of tags obtained through crowdsourcing in an effective way, by means of an algorithm. In fact, crowsourcing provides massive amounts of annotations, but these are not always trustworthy enough. So we aim at automating the process of deciding whether these are satisfactorily correct and of high quality, i.e. of evaluating them. This is done by first collecting manual evaluations about the quality of a small part of the tags contributed by a user, and then learning a statistical model from them. Throughout the paper we refer to this set as training set. On the basis of such a model, that assumes the existence of a relation between the user reputation and his overall performance, the system automatically evaluates the tags further added by the same user or user stereotype (i.e. set of users behaving similarly, e.g., users that always provide their tags on Sunday morning). We refer to this set of new tags as test set. Suppose that a user, Alex, provides annotations to the Fictitious National Museum. We propose a method that automatically evaluates these annotations, based on a small set of annotations that the museum previously evaluated from which we derive Alex reputation, or the reputation of the users whose annotation behaviour is similar to Alex. We will return on this example more in detail in the following sections. We employ Semantic Web technologies to represent and store the annotations and the corresponding reviews. We use subjective logic to build a reputation for users that contribute to the system, and moreover semantic similarity measures to generate assessments on the tags entered by the same users at a later point in time. By reputation, we mean a value indicating the estimated probability that the annotations provided by a given author (or, later in the paper, by a given user stereotype) are positively evaluated. In subjective logic, we use the expected value of an opinion about an author as the value of the reputation. The opinion is based on a set of previously evaluated tags. In order to reduce the computation time, we cluster evaluated tags to reduce the number of comparisons. Our experiments show that this preprocessing does not seriously affect the accuracy of the predictions, while significantly reducing the computation time. The proposed algorithms are evaluated on two datasets from the cultural heritage domain. These case studies show that it is possible to semi-automatically evaluate the tags entered by users

3 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 3 of 31 in crowdsourcing systems into binomial categories (good, bad) with an accuracy above 80%. Apart from using subjective logic and semantic similarity, we also use provenance mechanisms to evaluate the quality of user-contributed tags. Provenance is represented by means of a data record per tag, containing information on its creation such as time of day, day of the week, typing speed, etc., obtained by tracking user behavior. We use provenance information to group annotations according to the stereotype or behavior (or provenance group ) that produced them. In other words, we group them depending on whether they are produced by, for instance, early-morning or late-night users. Once the annotations have been grouped per stereotype, we compute a reputation for each stereotype, based on a sample of evaluations provided by an authority: we learn the policy adopted by the authority in evaluating the annotations and we apply the learnt model on further annotations. The fact that we take into account provenance and not rely solely on user identities makes our approach suitable for situations where users are anonymous and only their behavior is tracked. By policy we mean a set of rules that the institution adopts to evaluate tags. The fact that we do not have an explicit definition of such rules (nor we could ask for it) determines the need to learn a probabilistic model that aims at mimicking their evaluation strategy. For instance, we do not know a priori if one of two conflicting tags about the same image is wrong, both because we do not know the image (these could refer to two distinct image parts) and the museum policies (which may prohibit their coexistence, in principle). So, we rely only on the museum evaluations and we do not consider the possible impact of conflicting tags. We propose three algorithms for estimating the trustworthiness of tags. The first learns a reputation for each user based on a set of evaluated tags. Then it predicts the evaluations of the rest of the tags provided by the same user by ranking them according to the user performance in each specific domain (using semantic similarity measures) and then accepting a number of tags proportional to the user reputation. The second algorithm reduces the computational complexity by clustering the tags in the training set, and the third algorithm computes the same prediction on the basis of the user stereotype rather than on the basis of each single user identity. The novelty of this research lies in the automation of tag evaluations on crowdsourcing systems by coupling subjective logic opinions with measures of semantic similarity along with provenance metrics. The only variable parameter that we require is the size of the set of manual evaluations that are needed to build a useful and reliable reputation. Moreover, in the experiments that we performed, varying this parameter did not substantially affect the performance (resulting in about 1% precision variation per five new observations considered in a user reputation). We will further investigate in the future about the impact of this parameter. Using our algorithms, we show how it is possible to avoid asking the curators responsible for the quality of the annotations to set a threshold in order to make assessments about a tag trustworthiness (e.g., accept only tags which have a trust value above a given threshold). Background and literature review Trust has been studied extensively in Computer Science. We refer the reader to Sabater and Sierra [2], Gil and Artz [3] and Golbeck [4] for a comprehensive review of trust in computer science, Semantic Web, and Web respectively. The work presented in this paper

4 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 4 of 31 focuses on trust in crowdsourced information from the Web, using an adapted version of the definition of Castelfranchi and Falcone [5], reported by Sabater and Sierra [2], so we decide to trust or not trust tags based on a set of beliefs and assumptions about who produced the tags and how these were produced. We quantify these beliefs, for instance, though reputations. Crowdsourcing techniques are widely used by cultural heritage and multimedia institutions for enhancing the available information about their collections. Examples include the Tag Your Paintings project [6], the Steve.Museum project [7] and the Waisda? video tagging platform [8]. The Socially Enriched Access to Linked Cultural (SEALINC) Media project investigates also in this direction. In this project, Rijksmuseum [9] in Amsterdam is using crowdsourcing on a Web platform selecting experts of various domains to enrich information about their collection. One of the case studies analyzed in this paper is provided by the SEALINC Media project. Trust management in crowdsourced systems often employs wisdom of crowds approaches[10].inourscenarioswecannotmakeuseofthoseapproachesbecausethe level of expertise needed to annotate cultural heritage artifacts restricts the potential set of users, thus making this kind of approach inapplicable or less effective. Gamification, that consists of using game mechanisms for involving users in non-game tasks, is another approach that leads to an improvement of the quality of tags gathered from crowds, as shown, for instance, in von Ahn et al. [1]. The work presented here is orthogonal to a gamified environment, as it allows us to semi-automatically evaluate the user-contributed annotations and hence to semi-automatically incentivize them. By combining the two, museums could increase the user incentiviazion (showing his reputation may be enough to incentivize a user) while curating the quality of annotations. Users that participated in the experiments that provided the datasets for our analyses did not receive monetary incentives, so leveraging incentives related to gamification and personal satisfaction (by means of reputation tracking) may reveal to be an important factor in increasing the accuracy of the tags collected. In folksonomy systems such as the Steve.Museum project, tag evaluation techniques such as comparing the presence of the tags in standard vocabularies and thesauri, determining their frequency and their popularity or agreement with other tags (see, for instance, Van Damme et al. [11]) have been employed to determine the quality of tags entered by users. Such mechanisms focus mainly on the contributed content with little or no reference to the user who authored it. Also, in folksonomy systems the crowd often manages the tags, while in our scenarios, the crowd only provides the tags, that are managed by museums or other institutions, according to specific policies. Medeylan et al. [12] present algorithms to determine the quality of tags entered by users in a collaboratively created folksonomy, and apply them to the dataset CiteULike [13], whichconsists of text documents. Theyevaluate the relevance of user-provided tags by means of text document-based metrics. In our work, since we evaluate tags, we cannot apply document-based metrics, and since we do not have at our disposal large amounts of tags per subject, we cannot check for consistency among users tagging the same image. Similarly, we cannot compute semantic similarity based on the available annotations (like in Cattuto et al. [14]). In fact, since we do not have at our disposal image analysis software nor explicit museum policies, we can not know if possible conflicts between tags regarding the same image are due to the fact that some are correct and some not, or to the fact that they refer to different aspects (or parts) of a complex picture. Therefore, instead of

5 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 5 of 31 assuming one of the two cases a priori, we determine the trustworthiness of the tags on the basis of the reputation of their user or provenance stereotype. As future direction, we plan to consider also inputs from image recognition software that will help us dealing with conflicting or dubious tags. In open collaborative sites such as Wikipedia [15],where information is contributed by Web users, automated quality evaluation mechanisms have been investigated (see, for instance, De La Calzada et al. [16]). Most of these mechanisms involve computing trust from article revision histories and user groups (see Zeng et al. [17] and Wang et al. [18]). These algorithms track the changes that a particular article or piece of text has undergone over time, along with details of the users performing the changes. In our case study, tags do not have a revision history. Another approach to obtain trustworthy data is to find experts amongst Web users with a good intention (see De Martini et al. [19]). This mechanism assumes that users who are experts tend to provide more trustworthy annotations. It aims at identifying such experts, by analyzing the profiles built by tracking user performance. In our model, we build profiles based on user performance in the system. So the profile is only behavior-based, and rather than looking for expert and trustworthy users, we build a model which helps in evaluating the tag quality based on the estimated reputation of the tag author. However there is a clear relation between highly reputed users and experts, although these two classes do not always overlap. Modeling of reputation and user behavior on the Web is a widely studied domain. Javanmardi et al. [20] propose three computational models for user reputation by extracting detailed user edit patterns and statistics which are particularly tailored for wikis, while we focus on the annotations domain. Ceolin et al. [21] build a reputation- and provenance-based model for predicting the trustworthiness of Web users in Waisda? over time. We optimize the reputation management and the decision strategies described in that paper. We use subjective logic to represent user reputations, in combination with semantic relatedness measures. This work extends Ceolin et al. [21,22]. Similarity measures have been combined with subjective logic in Tavakolifard et al. [23], who infer new trust connections between entities (e.g., users) given a set of trust connections known a priori. In our paper, we also start from a graphical representation of relations between the various participating entities (annotators, tags, reviewers, etc.), but: (1) trust relationships are learnt from a sample of museum evaluations, and (2) new trust connections are inferred based on the relative position of the tags in another graph, WordNet. We also use semantic similarity measures to cluster related tags to optimize the computations. In Cilibrasi et al. [24], hierarchical clustering is used for grouping related topics, while Ushioda et al. [25] experiment on clustering words in a hierarchical manner. Begelman et al. [26] present an algorithm for the automated clustering of tags on the basis of tag co-occurrences in order to facilitate more effective retrieval. A similar approach is used by Hassan-Montero and Herrero-Solana [27]. They compute tag similarities using the Jaccard similarity coefficient and then cluster the tags hierarchically using the k-means algorithm. In our work, to build user reputations, we cluster the tags along with their respective evaluations (e.g., accept or reject). Each cluster is represented by a medoid (that is, the element of the cluster which is the closest to its center), and in order to evaluate a newly entered tag by the same user, we consider clusters which are most semantically relevant to the new tag. This helps in selectively weighing only the relevant evidence about a user for evaluating a new tag.

6 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 6 of 31 In general, different cultural heritage institutions employ different values and metrics of varying scales to represent the trustworthiness of user-contributed information. The accuracy of various scales has been studied earlier. Certain cases use a binary (boolean) scale for trust values, as in Golbeck et al. [28], while binomial values (i.e., the probabilities of two mutually exclusive values zero and one, that we use in our work) are used in Guha et al. [29] and Kamvar et al. [30]. Relevant for the work presented in this paper is the link between provenance and trust. Bizer and Cyganiak [31], Hartig and Zhao [32] and Zaihrayeu et al. [33] use provenance and background information expressed as annotated or named graphs to produce trust values. We use the same class of information to make our estimates, but we do not use named graphs to represent provenance information. We represent provenance by means of the W3C recommendation PROV-O, the PROV Ontology [34,35]. Provenance is employed for determining trustworthiness of user-contributed information in crowdsourced environments in Ceolin et al. [21], where provenance information is used in combination with user reputation to make binomial assessments of annotations. Also, they employ support vector machines for making the provenance-based estimates, while we employ a subjective logic-based approach. Provenance is used for data verification in crowdsourced environments by Ebden et al. [36]. In their work, they introduced provenance tracking into their online CollabMap application (used to crowdsource evacuation maps), and in this way they collect approximately 5,000 provenance graphs, generated using the Open Provenance Model [37] (which has now been superseded by the PROV Data Model and Ontology). In their work they have at their disposal large provenance graphs and can learn useful features about the artifact trustworthiness from the graphs topologies.here,thegraphsatourdisposalaremuchmorelimited,sowecannot rely on the graph topology, but we can easily group graphs in stereotypes. Provenance mechanisms have also been used to understand and study workflows in collaborative environments as discussed in Altintas et al. [38]. We share the same context with that work, but we do not focus on the workflow of artifact creation. The rest of the paper is structured as follows: first we describe the research questions tackled and we connect the datasets used with them. Then we explain our research work in detail and present and discuss the obtained results. Lastly, we provide some final conclusions. Research design and methodology The goal of this paper is to propose an algorithm for computing the trustworthiness of annotations in a fast and reliable manner. We focus on three main evaluation aspects for our algorithm. First, the trust values produced by our algorithm are meant as indicators of the trustworthiness of annotations, and therefore they must be accurate enough to warrant their usefulness. Accuracy of trust values is achieved by carefully handling the information at our disposal and by utilizing the existence of a relationship between the features considered (e.g., the annotation creator) and the trust values themselves. If the information is handled correctly and the relationship holds, then the trust values are accurate enough and form a basis to automatically decide whether or not to use the annotations. We evaluate this first research question by applying our algorithm on two different datasets, one from a SEALINC Media project experiment and the other from the Steve.Museum dataset. In both cases we divide the dataset into two parts, training set

7 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 7 of 31 and test set, so as to build a model based on subjective logic and semantic similarity in the training set, and then evaluate the accuracy of such a model on the test set. The second evaluation we make regards the possibility to perform trust estimations in a relatively fast manner, by properly clustering the training set on a semantic similarity basis. Here the goal of the contribution is to reduce the computational overhead due to avoidable comparisons between evaluated annotations and new annotations. We evaluate this contribution by applying clustering mechanisms in the training set data of the aforementioned datasets and by running our algorithmfor computing trust valuesonthe clustered training sets. The evaluation will check whether clustering reduces the computation time (and in case it does, up to which magnitude) and whether it affects the accuracy of the predictions. Finally, we show that the algorithm we propose is dependable and not solely dependent on the availability of information about the author of an annotation. Our assumption is that when the identity of the author is not known or when a reliable reputation about the author is not available, we can base our estimates on provenance information, that is, on a range of information about how the tag has been created (e.g., the timestamp of the annotation). By properly gathering and grouping such information to make it utilizable, we can use it as a stereotypical description of a user s behavior. Users are often constrained in their behavior by the environment and other factors; for instance, they produce tags within certain periodic intervals, such as the time of the day or day of the week. Being able to recognize such stereotypes, we can compute a reputation per stereotype rather than per user. While on the one hand this approach guarantees the availability of evidence, as typically multiple users belong to the same stereotype, on the other hand this approach compensates on the lack of evidence about specific users. We evaluate our hypothesis over the two datasets mentioned before by splitting them into two parts, one to build a provenance-based model and the other to evaluate it. Methods Here we describe the methods adopted and implemented in our algorithm. We start describing the tools that were already available and that we incorporated in our framework, and then we continue with the framework description. Preliminaries The system that we propose aims at estimating the trustworthiness of new annotations based on a set of evaluated ones (per user or per provenance stereotype, as we will see). In order to make the estimates, we need to make use of a probabilistic logic that allows us to model and reason about the evidence at our disposal while accounting for uncertainty due to the limited information available. For this reason we employ subjective logic, which fits our needs. Moreover, since the evidence at our disposal consists of textual annotations, we use semantic similarity measures to understand the relevance of each piece of evidence when analyzing each different annotation. Subjective logic Subjective logic is a probabilistic logic that we extensively use in our system in order to reason about the trustworthiness of the annotations and the reputations of their authors, based on limited samples. In subjective logic, so-called subjective opinions (represented

8 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 8 of 31 Probability Density Trust Values Figure 1 Beta distribution of the user trustworthiness. We use a Beta distribution to describe the probability for each value in the [ 0...1] interval to be the right reputation for a given user or the right trust value for a given tag. as ω) express the belief that source x owns with respect to the value of assertion y (for instance, a user s reputation). When y can assume only two values (e.g., true or false), the opinion is called binomial ; when y ranges over more than two values, the opinion is called multinomial. Opinions are computed as follows, where the positive and negative evidence are represented as p and n,respectively,andb, d, u and a represent the belief, disbelief, uncertainty and prior probability, respectively. A binomial opinion is represented as: where: ωy x (b, d, u) (1) b = p p + n + 2 d = n p + n + 2 u = 2 p + n + 2 a = 1 2 Such an opinion is equivalent to a Beta probability distribution (see Figure 1), which describes the likelihood for each possible trust value to be the right trust value for a given subject. An expected probability for a possible value of an opinion is computed as: E = b + a u (3) The belief b of the opinion represents how much we believe that the y statement is true (where y is, for instance, the fact that an annotation is correct), based on the evidence at our disposal. However, the evidence at our disposal is limited (and since the evidence is obtained by asking an authority, such as a museum, to evaluate an annotation, we aim at keeping it limited in order to avoid overloading the museum with such requests), so we must take into account the fact that other observations about the same fact might, in principle, disagree with those currently at our disposal. This leads to uncertainty, which is represented by the corresponding element of the opinion, u.thedisbeliefd represents, instead, the disbelief that we have about the statement y based on actual negative evidence (2)

9 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 9 of 31 at our disposal. Disbelief is the counterpart of belief, and the sum of belief, disbelief and uncertaintyisalwaysone(b + d + u = 1). Whereas belief, disbelief and uncertainty are based on the actual evidence observed, the base rate a encodes the prior knowledge about the truth of y according to x (before observing any evidence). The final opinion about the correctness of y is then obtained by aggregating a and b in a weighted manner, so as to make b count more if it is based on more evidence than a (and thus its uncertainty is lower). This explains the meaning of E, which represents exactly such an aggregation. One last consideration regards the fact that E represents both the expected value of the opinion and of the Beta distribution equivalent to it. u is tightly connected to the variance of the Beta distribution, as they both represent the uncertainty about the belief to be significant. These elements make subjective opinions ideal tools for representing the fact that the probabilities that we estimate about the trustworthiness of annotations, authors and provenance stereotypes (that are defined later in this section)are based on evidence provided by a museum (that consist of evaluations based on the museum s policy), and that the estimates we make carry some uncertainty due to the fact that they are based on a limited set of observations. Semantic similarity The target of our trust assessments are annotations (that is, words associated to images, in order to describe them), and our evidence consists of evaluated annotations. In our system we collect evidence about an author or a provenance stereotype (which is defined later in this section) that consist of annotations evaluated by the museum, and we compare each new annotation that needs to be evaluated against the evidence at our disposal. If, for instance, we based our estimates only on the ratio between positive and negative evidence of a given author, then all his annotations would be rated equally. If, instead, we compared every new annotation only against the evidence at our disposal of matching words, we would have a more tailored, but more limited estimate, as we cannot expect that the same author always uses the same words (and so that there is evidence for every word he uses in an annotation). In order to increase the availability of evidence for our estimate and to let the more relevant evidence have a higher impact on those calculations, we employ semantic relatedness measures as a weighing factor. These measures quantify the likeness between the meaning of two given terms. Whenever we evaluate a tag, we take the evidence at our disposal, and tags that are more semantically similar to the one we focus on are weighed more heavily. There exist many techniques for measuring semantic relatedness, which can be divided into two groups. First, we have so-called topological semantic similarity measures, which are deterministic measures based on the graph distance between the two words examined, based on a word graphs (e.g. WordNet [39]). Second, there is the family of statistical semantic similarity measures, which include, for instance, the Normalized Google Distance [40], that measures statistically the similarity between two words on the basis of the number of times that these occur and co-occur in documents indexed by Google. These measures are characterized by the fact that the similarity of two words is estimated on a statistical basis from their occurrence and co-occurrence in large sets of documents.

10 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 10 of 31 We focus on deterministic semantic relatedness measures based on WordNet or its Dutch counterpart Cornetto [41]. In particular, we use the Wu and Palmer [42] and the Lin [43] measure for computing semantic relatedness between tags, because both provide us with values in the range [0, 1], but other measures are possible as well. WordNet is a directed and acyclic graph where each vertex v, w is an integer that represents a synset (set of word synonyms), and each directed edge from v to w implies that w is a hypernym of v. In other words w shares a type-of relation with v.forinstance,ifv is the word winter (hyponym), w can be the word season (hypernym). The Wu and Palmer measure calculates semantic relatedness between two words by considering the depths between two synsets in WordNet, along with the depth of the Least Common Subsumer, as follows: depth(lcs(s1, s2)) score(s1, s2) = 2 depth(s1) + depth(s2) where s1 is a synset of the first word and s2 of the second. WordNet is an acyclic graph where nodes are represented by synsets and edges represent hypernym/hyponym relations. If a synset is a generalization of another one, we can measure the depth, that is the distance between the two. The first ancestor shared by two nodes is the Least Common Subsumer. We compute the similarity of all synsets combinations and pick the maximum value, as we adopt the upper bound of the similarity between the two words. The Lin measure considers the information content of the Least Common Subsumer and the two compared synsets, as follows: IC(lcs(s1, s2)) 2 IC(s1) + IC(s2) where IC is the information context, that is the probability of finding the concept in a given corpus, and is defined as: ( ) freq(s) IC(s) = log freq(root) and freq is the frequency of the synset. So the Wu and Palmer measure derives the similarity of two concepts from their distance from a common ancestor, while the Lin similarity derives it from the information content of the two concepts and their lowest ancestor. The Wu and Palmer similarity measure is more recent and has shown to be effectively combinable with subjective logic in Ceolin et al. [44], so when we deal with datasets of tags in English (Steve.Museum dataset), we use its implementation provided by the python nltk library [45]. Instead, when we work on datasets composed of Dutch tags (e.g., the dataset from the SEALINC Media project experiment), we rely on pycornetto [46], an interface to Cornetto, the Dutch WordNet. pycornetto does not provide a means to compute the Wu and Palmer similarity measure, but it provides the Lin similarity measure, and given the relatedness of the two measures, in this case we adopt the Lin measure. For more details about how to combine semantic relatedness measures and subjective logic, see the work of Ceolin et al. [44]. By choosing to use these measures we limit ourself in the possibility to evaluate only single-word tags and only common words, because these are the kinds of words that are present in WordNet. However, we choose these measures because almost all the tags we evaluate fall into the mentioned categories and because the use of these similarity measures together with subjective logic has already been theoretically validated. Moreover, almost all the words used in the annotations that form the dataset we used in our evaluations are single-word tags and common words,

11 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 11 of 31 hence this limitation does not affect our evaluation significantly. The algorithm proposed is designed so that any other relatedness measure could be used in place of the chosen ones, without the need of any additional intervention. The choice of the semantic similarity and how the semantic similarity is used both affect the uncertainty of the expected results of the algorithms that we propose. In fact, these algorithms use semantic similarity to weigh the importance of evidence when evaluating tags, i.e. words associated with cultural heritage artifacts. We use a deterministic semantic similarity measure which, although it constitutes a heuristics, is based on a trustworthy data source (WordNet) and this implies that the measure is less uncertain than a probabilistic measure based on a limited document corpus. Still, the semantic similarity measure represents an approximation and part of the uncertainty these imply is due to their use: semantic similarity measures represent the similarity between synsets, but we have at our disposal only words without an indication of their intended synset (a word may have more meanings, represented by synsets). Since we are situated in a well-defined domain (cultural heritage), and since words are all used to tag cultural heritage artifacts, we assume that words are semantically related, and hence, when computing the semantic similarity between two words,we make use of the maximum of the similarity between all their synsets. Despite the fact that this introduces an approximation, we will show in Section Results that the method is effective. Datasets adopted We validate the algorithms we propose over two datasets of annotations of images. The annotations contained in these datasets consist of content descriptions and the datasets contain also the evaluations from the institutions that collected them. For each annotation, the datasets contain information about its author and a timestamp. Since each institution adopts a different policy for evaluating annotations, we try to learn such a policy from a sample of annotations per dataset, and find a relationship between the identity oftheauthororotherinformationabout the annotation and its evaluation. SEALINC media project experiment As part of SEALINC Media project, the Rijksmuseum in Amsterdam [9] is crowdsourcing annotations of artifacts in its collection using Web users. An initial experiment was conducted to study the effect of presenting pre-set tags on the quality of annotations on crowdsourced data [47]. In the experiment, the external annotators were presented with pictures from the Web and prints from the Rijksmuseum collection along with a pre-set annotations about the picture or print, and they were asked to insert new annotations, or remove the pre-set ones which they did not agree with (the pre-set tags are either correct or not). A total of 2,650 annotations resulted from the experiment, and these were manually evaluated by trusted personnel for their quality and relevance using the following scale: 1 : Irrelevant 2 : Incorrect 3:Subjective 4 : Correct and possibly relevant 5 : Correct and highly relevant typo : Spelling mistake

12 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 12 of 31 These tags, along with their evaluations, were used to validate our model. For each tag, the SEALINC Media dataset presents the following elements: author identifier, artifact identifier, timestamp, evaluation. We do not focus on the goals of the experiment from which this dataset is obtained, that is, we do not analyze the relation between the kind of tag that was proposed to the user, and the tag that the user provided. We focus on the tag that the user actually proposes and its evaluation and we try to predict the evaluation of the tags provided by each user, given a small training set of sample evaluations about each of them. Weneglectthetagsevaluatedas Typo becauseourfocusisonthesemanticcorrectness of the tags, so we assume that such a category of mistakes would be properly avoided or treated (e.g., by using autocompletion and checking the presence of the tags in dictionaries) before the tags reach our evaluation framework. We build our training set using a fixed amount of evaluated annotations for each of the users, and form the test set using the remaining annotations. The number of annotations used to build the reputation and the percentage of the dataset covered is presented in Table 1: in the first column # annotation per reputation we report the number of evaluated annotations we use to build each reputation, while in the second column, % training set covered we report the percentage of annotation used as training set compared to the whole dataset. Steve.Museum project dataset Steve.Museum [7] is a project involving several museum professionals in the cultural heritage domain. Part of the project focuses on understanding the various effects of crowdsourcing cultural heritage artifact annotations. Their experiments involved external annotators annotating museum collections, and a subset of the data collected from the crowd was evaluated for trustworthiness. In total, 4,588 users tagged the 89,671 artifacts using 480,617 tags from 21 participating museums. Part of these annotations consisting of 45,860 tags were manually evaluated by professionals at these museums and were used as a basis for our second case study. In this project, the annotations were classified in a more refined way, compared to the previous case study, namely as: {Todo, Judgement-negative, Judgement-positive, Problematic-foreign, Problematic-huh, Problematic-misperception, Problematic-misspelling, Problematic-no_consensus, Problematic-personal, Usefulnessnot_useful, Usefulness-useful}. There are three main categories: judgement (a personal judgement by the annotator about the picture), problematic (for several, different reasons) and usefulness (stating whether the annotation is useful or not). We consider only usefulness-useful as a positive judgement, all the others are considered as negative evaluations. The tags classified as todo are discarded, since their evaluation has not been Table 1 Results of the evaluation of Algorithm 1 over the SEALINC Media dataset # Tags per % Training Time reputation set covered Accuracy Precision Recall F-measure (sec.) 5 8% % % % Results of the evaluation of Algorithm 1 over the SEALINC Media dataset for training sets formed by aggregating 5, 10, 15 and 20 reputations per user. We report the percentage of dataset actually covered by the training set, the accuracy, the precision, the recall and the F-measure of our prediction.

13 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 13 of 31 performed yet. The Steve.Museum dataset is provided as a MySQL database and consists of several tables. Those most important for us are: steve_term, that contains information like the identifiers for the artifact annotated and the words associated with them (tags); steve_session that reports information about when the tags are provided and by whom, and steve_term_review that contain information about the tag evaluations. We join these tables and we select the information that is relevant for us: the tags, their authors, their timestamps (i.e. date and time of creation) and their evaluation. We partition this dataset thus obtained into a training and a test set, as shown in Table 2, along with their percentage coverage of the whole dataset and the obtained results (in the second column, % of training set covered ). We use the training set to learn a model for evaluating user provided tags, and we consider the tags in the test set as tags newly introduced by the authors for which we build a model. System description After having introduced the methods that we make use of, and the datasets that we analyze, we provide here a description of the system that we propose. High-level system overview The system that we propose aims at relieving the institution personnel (reviewers in particular) from the burden of controlling and evaluating all the annotations inserted by users. The system asks for some interaction with the reviewers, but tries to minimize it. Figure 2 shows a high-level view of the model. For each user, the system asks the reviewers to review a fixed number of annotations, and on the basis of these reviews it builds user reputations. A reputation is meant to express a global measure of trustworthiness and accountability of the corresponding user. The reviews are also used to assess the trustworthiness of each tag inserted afterwards by a user: given a tag, the system evaluates it by looking at the evaluations already available. The evaluations of the tags semantically closer to the one that we evaluate have a higher impact. So we have two distinct phases: a first training step where we collect samples of manual reviews, and a second step where we make automatic assessments of tags trustworthiness (possibly after having clustered the evaluated tags, to improve the computation time). The more reviews there are, the more reliable the reputation is, but this number depends also on the workforce at the disposal of the institution. On the other hand, as we will see in the following section, this parameter does not affect significantly the accuracy obtained. Moreover, we do not need to set an acceptance threshold Table 2 Results of the evaluation of Algorithm 1 over the Steve.Museum dataset # Tags per % Training Time reputation set covered Accuracy Precision Recall F-measure (sec.) 5 18% % % % % % Results of the evaluation of Algorithm 1 over the Steve.Museum dataset for training sets formed by aggregating 5, 10, 15, 20, 25 and 30 reputations per user. We report the percentage of dataset actually covered by the training set, the accuracy, the precision, the recall and the F-measure of our prediction.

Ceolin et al. Journal of Trust Management 2014, 1:3 Page 14 of 31 Figure 2 High-level overview of the system. The figure shows the high-level overview of the system.

14 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 14 of 31 Figure 2 High-level overview of the system. The figure shows the high-level overview of the system. We evaluate a subset of user-contributed tags and use it to build the annotator reputation model. We evaluate the remaining set of the tags based on semantic similarity measures and subjective logic. The unevaluated tags are then ranked based on the annotator reputation model and the model serves as a basis to accept or reject the unevaluated tags from the users. (e.g., accept only annotations with a trust value of say at least 0.9, for trust values ranging from zero to one), in contrast to the work of Ceolin et al. [21]. This is important since such a threshold is arbitrary, and it is not trivial to find a balance between the risk to accept wrong annotations and to reject good ones. We introduce here a running example that accompanies the description of the system in the rest of this section. Suppose that a user, Alex (whose profile already contains three tags which were evaluated by the museum), newly contributes to the collection of the Fictitious National Museum by tagging five artifacts. Alex tags one artifact with Chinese. If the museum immediately uses the tag for classifying the artifact, it might be risky because the tag might be wrong (maliciously or not). On the other hand, had the museum enough internal employees to check the external contributed tag, then it would not have needed to crowdsource it. The system that we propose here relies on few evaluations of Alex s tags by the Museum. Based on these evaluations, the system: (1) computes Alex s reputation; (2) computes a trust value for the new tag; and (3) decides whether to accept it or not. We describe the system implementation in the following sections. Annotation representation We adopt the Open Annotation model [48] as a standard model for describing annotations, together with the most relevant related metadata (like the author and the time of creation). The Open Annotation model allows to reify the annotation itself, and by treating it as an object, we can easily link to it properties like the annotator URI or the time of creation. Moreover, the review of an annotation can be represented as an annotation which target is an annotation and which body contains a value of the review about the annotation. To continue with our example, Figure 3 and Listing 1 show an example of an annotation and a corresponding review, both represented as annotations from the Open Annotation model.

15 Ceolin et al. Journal of Trust Management 2014, 1:3 Page 15 of 31 Listing 1 Example of an annotation and respective evaluation. The annotation is represented using the Annotation class from the Open Annotation model. The evaluation is represented as an annotation of the rdf : <http :// org/1999/02/22 rdf syntax oac : <http :// foaf: < ex:user_1 oac:annotator Annotation; foaf:givenname "Alex". ex : annotation_1 oac : hasbody tag : Chinese ; oac : annotator ex : user_1 ; oac : hastarget ex : img_231 ; rdf : type oac : annotation. ex : review oac : hasbody ex : ann_accepted ; oac : annotator ex : reviewer_1 ; oac : hastarget ex : annotation_1; rdf : type oac : annotation. ex : annotation_accepted oac : annotates ex : annotation_1. Trust management We employ subjective logic [49] for representing, computing and reasoning on trust assessments. There are several reasons why we use this logic. First, it allows to quantify the truth of statements regarding different subjects (e.g. user reputation and tag trust value) by aggregating the evidence at our disposal in a simple and clear way that accounts both for the distribution of the observed evidence and the size of it, hence quantifying the uncertainty of our assessment. Second, each statement in subjective logic is equivalent to a Beta or Dirichlet probability distribution, and hence we can tackle the problem from a statistical point of view without the need to change our data representation. Third, the logic offers several operators to combine the assessments made over the statements of our interest. We made a limited use of operators so far, but we aim at expanding this in the near future. Lastly, we use subjective logic because it allows us to represent formally the fact that the evidence we collect is linked to a given subject (user, tag), and is based on a specific point of view (reviewers for a museum) that is the source of the evaluations. Trust is context-dependent, since different users or tags (or, more in general, agents and artifacts) might receive different trust evaluations, depending on the context from which they situate, and the reviewer. In our scenarios we do not have at our disposal an explicit description of trust policies by the museums. Also, we do not aim at determining a generic tag (or user) trust level. Our goal is to learn a model that evaluates tags as closely as Figure 3 Representation of the annotations and their reviews. We represent annotations and their reviews as annotations from the open annotation model.

Combining User Reputation and Provenance Analysis for Trust Assessment

Combining User Reputation and Provenance Analysis for Assessment Davide Ceolin, VU University Amsterdam Paul Groth, Elsevier B.V. Valentina Maccatrozzo, VU University Amsterdam Wan Fokkink, VU University