Reviewers Thomas Francart, Charles Nepote. PROJET DATALIFT De la donnée brute publiée vers la donnée sémantique interconnectée

Size: px

Start display at page:

Download "Reviewers Thomas Francart, Charles Nepote. PROJET DATALIFT De la donnée brute publiée vers la donnée sémantique interconnectée"

Morgan Gray
5 years ago
Views:

1 Serena Villata Fabien Gandon FROM RAW PUBLISHED DATA TO INTERLINKED SEMANTIC DATA Authors Serena Villata, Fabien Gandon Reviewers Thomas Francart, Charles Nepote Date Reference D3.5 Version 1.1 Destination Public PROJET DATALIFT De la donnée brute publiée vers la donnée sémantique interconnectée Appel ANR CONTINT 2010 ANR-10-CORD-009 RAPPORT DE RECHERCHE 8/08/2011

2 STATE OF THE ART ON DATA PROVENANCE AND DATA LICENSING Authors: Serena Villata, INRIA Fabien Gandon, INRIA Reviewers: Thomas Francart, Mondeca Charles Nepote, Fing 2

3 DataLift: From Raw Publish Data To Interlinked Semantic Data Introduction...4 Data Provenance... 7 Data Licensing...11 Organization of the Report...16 Data Provenance...17 Dublin Core...18 Semantic Web Publishing...21 Open Provenance Model...25 The FOAF Vocabulary...27 SWAN-SIOC Alignment...29 The Provenir Ontology...30 The Provenance Vocabulary...33 The Changeset vocabulary...34 Data Catalog Vocabulary...36 RDF Coloring...37 Provenance Working Group...38 Data Licensing...42 Creative Commons...44 Open Data Commons...48 MPEG-21 REL...49 Waiver vocabulary...51 Related Vocabularies...52 Recent Developments in Access Control for the Web and Linked Data...53 Description Logic for Access Control Context- dependent Access Control The AMO Ontology Access Policies for Linked Data Discussion and Perspective...61 Bibliography /08/2011 3

4 Introduction This report is the deliverable D3.5 of the DataLift project ANR-10-CORD-009 and proposes a state of the art on data provenance and data licensing. In the last years, numerous XML languages, and in general semantic approaches for security have been developed. We start by providing the reader an overview of a selection of these approaches, which is useful to know before starting the discussion about data provenance, and data licensing. The Platform for Internet Content Selection (PICS) [44] was developed by a cross-industry working group whose goal was to facilitate the development of technologies to give users of interactive media as the Internet, the control over the kinds of material to which they and their children have access. The idea is that individuals, groups and businesses should have easy access to the widest possible range of content selection products. In order to advance its goals, PICS devises a set of standards that facilitate the following, as reported into the PICS Statement of Principles 1 : (i) Self-rating: enable content providers to voluntarily label the content they create and distribute, (ii) Third-party rating: enable multiple, independent labelling services to associate additional labels with content created and distributed by others (services may devise their own labelling systems, and the same content may receive different labels from different services), and (iii) Ease-of-use: enable parents and teachers to use ratings and labels from a diversity of sources to control the information that children under their supervision receive. The Platform for Internet Content Selection (PICS) has been superseded by the Protocol for Web Description Resources (POWDER) [43]. As explained in the W3C Working Draft 2, the aim of this protocol is to provide a mean for

5 DataLift: From Raw Publish Data To Interlinked Semantic Data individuals or organizations to describe a group of resources through the publication of machine-readable metadata. It defines the terms designed to aid the building of trust. In particular, it defines the authenticate property with a domain of dcterms:agent and foaf:agent. It is part of the description of an entity that creates POWDER documents, not of the POWDER documents themselves. The range of authenticate is simply rdfs:resource, and so its value can be machine-readable, or a humanreadable document describing any authentication services that the Description Resources author offers. Trust in Description Resources can be enhanced through third-party certification. An entity is allowed to publish data that is identifiable through its origin or a digital signature which states that the entity certifies that the assertions made in an identified POWDER document are correct. Given this kind of certification, the level of trust the user places in the data then becomes equivalent to the level of trust held in the certification entity. The XML Encryption syntax and processing [45] specifies a process for encrypting data and representing the result in XML. The data may be XML arbitrary data such as an XML document, an XML element, or XML element content. The result of an XML Encryption consists of an EncryptedData element that contains (via one of its children's content) or identifies (via a URI reference) the cipher data. When encrypting an XML element, the EncryptedData element replaces the element in the encrypted version of the XML document. When encrypting arbitrary data, including entire XML documents, the EncryptedData element may become the root of a new XML document, or become a child element in an application-chosen XML document. The Web Security Context (WSC) [46] provides user interfaces that help users make trust decisions on the Web. The W3C Working Group Note 3 about WSC presents the following use cases: Web user agents are used to engage in a great variety and number of commercial and personal activities. Though the medium for these activities has changed, the potential for fraud has not. The aim of the WSC is to specify a baseline set of security context information that should be accessible to Web users, and the developed practices for the secure presentation of this information, to enable users to come to a better 3 8/08/2011 5

6 understanding of the context in which they are operating when making trust decisions on the Web. Finally, the Web Access Control (WAC) [47] is a decentralized system for allowing different users various forms of access to resources, where the users are identified by URIs. The system is similar to those systems used by file systems like UNIX except that the resources, the users and the groups are all identified by URIs. In particular, users are identified by WebIDs 4. The data provider can give the access to a document on one site to users and groups hosted by other sites. Users do not need to have a profile on the site to have access to the documents on it. A common ontology has been proposed which provides the terms necessary for access control lists to be stored. In this ontology, an Authorization is an abstract thing whose properties are defined in an Access Control List. The ontology presents four modes of access: Read (read the contents), Write (modify the content), Control (set the access control list), and Append (add information, but not delete information). The following table provides some of the terms that will be used in the reminder of the report, and the intended meaning we assume. Term Copyright Digital signature License Policy Meaning A copyright is a set of exclusive rights (to copy, distribute and adapt) granted by a state to the creator of an original work or their assignee for a limited period of time in exchange for public disclosure of the work. A digital signature is a mathematical scheme for demonstrating the authenticity of a digital message or document, and it gives a recipient reason to believe that the message was created by a known sender, and that it was not altered in transit. A license is granted by a licensor to a licensee as an element of an agreement between those parties. It is an authorization by the licensor to use the licensed material by the licensee. It allows an activity that would otherwise be forbidden. A policy is a principle or a rule to guide decisions. Policies differ from law. While law can compel or prohibit behaviours, policies guide actions toward those that are most likely to achieve a desired 4 6

7 DataLift: From Raw Publish Data To Interlinked Semantic Data Waiver Warrant outcome. A waiver is the voluntary relinquishment or surrender of some known right or privilege. In the legal field, to waive means to give up voluntarily a right or privilege to something. A warrant refers to a specific type of authorization. Data Provenance The provenance of information is crucial for deciding about whether information can be trusted, how to integrate diverse information sources, and how to trust the data provides when reusing information. In an open environment such as the Web, users have to do with information that is often contradictory or questionable. In particular, with the arrival of the Web of Data [35] (e.g., via the Linked Open Data 5 initiative), the provenance of that data becomes an important factor for developing new Semantic Web applications. In particular, we can underline two important issues: 1. choosing a provenance data model that is practical and easy-touse, and 2. providing an explicit representation of provenance that is machines-readable. Zhao et al. [12] suggest that a first choice would be to focus specifically on representing the information using the RDF model. The uniform representation of RDF is appealing for a number of reasons. Among others, it can be used for extending the inference capabilities of Semantic Web reasoners, and for accessing RDF large knowledge bases. However, this uniform representation of data and metadata requires capabilities that the basic RDF model does not offer [12]. An issue related to the provenance information is delegation. Delegation is the assignment of authority and responsibility to another person to carry out specific activities. In the Semantic Web, the information provider may decide 5 8/08/2011 7

8 to delegate some authority to another provider, or to a data consumer. A policy language can be used to manage and delegate trust, and access rights according to the user s profile. For instance, Hu [49] presents a rights delegation ontology. A deeper analysis of delegation in the context of the Semantic Web is out of the scope of this report. Buneman et al. [4] define data provenance as the description of the origins of a piece of data, and the process by which it arrives in a database. In the Web of Data, the aim of the Linked Open Data community is to publish information on the Semantic Web. This scenario implies the involvement of two kinds of individuals: the information providers, who publish information together with meta-information about its intended status, and the information consumers, who has to decide whether to accept the information provided by the data providers [11]. The providers can also publish further information about themselves, and they may decide to digitally sign the information they publish. It is evident from this scenario that the representation of data provenance is of great importance in the Web of Data. Zhao et al. [12] provide in a report an overview of requirements for provenance. The report uses extensive examples in the following three diverse scenarios that illustrate the need for provenance: 1. a news aggregator site that assembles news items from a variety of sources (e.g., blogs, and tweets), where provenance records can help with verification, credit, and licensing; 2. a data integration and analysis activity for studying the spread of a disease, involving public policy and scientific research, where provenance records support combining data from very diverse sources, justification of claims and analytic results, and documentation of data analysis processes for validation and reuse; and 3. a commercial company that requires provenance information about their software development and testing procedure in order to defend the validity of their contract execution. Zhao et al. [12] group the requirements for provenance into three major areas of concern: content, management, and use. In this document, we summarize the main issues they address, highlighting the relevant requirements for the 8

9 DataLift: From Raw Publish Data To Interlinked Semantic Data RDF model. For more details, see [12],[13]. The content refers to the type of information that provenance records need to contain. This content may include entities and processes that contributed to its creation. It may also include argumentations, design choices, and justifications for decisions. The requirements identified by Zhao et al. [12] for the content are the following: Requirement 1: Identity Ability to refer to the artifact that we are describing the provenance for. Within the RDF context, the artifact could be a single RDF statement, a set of statements or an arbitrary set of Web resources. Requirement 2: Evolution Ability to describe the provenance of a dynamic, evolving resource. Over time, there may be updates, and even new versions that change some aspect of the resource. A challenge is to describe how the new versions of the resource relate to one another, and to determine whether provenance records should be self-contained and attached to each version. Requirement 3: Entailment Another requirement is to distinguish what is directly asserted by the entities and processes that produce the resource from other information that may be inferred from those assertions or perhaps derived by a third party. The management requirement refers to the mechanisms that make the provenance information available and accessible in a system like the Web. This includes the representation language for provenance records, the methods for publication and dissemination, and for accessibility and querying of provenance records. Zhao et al. [12] identify the following requirements concerning management: Requirement 4: Publication - A publisher of provenance information has to use a provenance representation language, and link the provenance assertions to the actual resource information. The information publisher may choose to publish only a subset of the provenance records, and should be able to identify themselves for instance with a signature, verifiable by others. 8/08/2011 9

10 Requirement 5: Querying - Provenance information should be accessible in some way, and there must be mechanisms to find the provenance for a given resource. Query formulation and execution must be provided for provenance information. Finally, the uses of provenance mentioned by Zhao et al. [12] refer to the purpose and usage of provenance information. This includes presentation and visualization of provenance information, supporting abstraction and customization, integration of provenance from heterogeneous systems, allowing trust judgments based on provenance information, and handling imperfections in provenance records. In the field of Semantic Web, the latest years have seen many efforts for supporting the provenance requirements described above. For instance, extensions of the existing RDF data model, new alternative models of RDF and vocabularies for describing evolutions, versioning, annotations, have been proposed. We summarize briefly the current state of the art in representing provenance information in RDF and list current approaches to extend RDF for a representation of provenance information. Further details about these approaches are provided in Section 2. Two directions can be highlighted in defining the state of the art in data provenance: (i) approaches to represent provenance information and other information together under an integrated model (i.e., following Requirement 1 above), (ii) development of schema, ontologies, and vocabularies to represent and publish specific types of provenance information (i.e., Requirements 2, 3, and 4 above). The state of the art concerning the representation of RDF together with meta-information is RDF Reification [14]. The RDF reification vocabulary is propagated in the current RDF recommendation for making RDF statements identifiable. This vocabulary consists of the four terms rdf:statement, rdf:subject, rdf:predicate, and rdf:object. RDF reification can be used to represent provenance information. Nevertheless, RDF reification is rarely 10

11 DataLift: From Raw Publish Data To Interlinked Semantic Data picked up by the community. As pointed out by Zhao et al. [12], note that querying reified RDF statements with the SPARQL query language is cumbersome, thus it only partly fulfils Requirement 5 above. Various alternative approaches to RDF reification have been published, among others N3 Formula [37], N-Quads [38], Temporal RDF [39], and Named Graphs [11]. The second area concerning data provenance is driven by the need to exchange the provenance information between systems and the publication of provenance information together with primary information on the Web. Various models have already been defined to represent specific types of provenance information. These models include: The Open Provenance Model [5], Dublin Core [15],[16],[17] Semantic Web Publishing Vocabulary [18],[11], the FOAF Vocabulary [22], the SWAN-SIOC alignment [19], the Provenir Ontology [23], the Provenance Vocabulary [24], the Changeset Vocabulary [20], the Data Catalog Vocabulary 6, RDF coloring [21]. At the end of Section 2, we present part of the issues discussed by the W3C Provenance Working Group, formed in September 2009 as part of the W3C Semantic Web Activity. In this subsection, we provide a roadmap in the area of provenance for Semantic Web technologies, development, and possible standardization. In this context, the group has produced a number of documents, including reports of key dimensions for provenance, use cases spanning many areas and contexts that illustrate these key dimensions. Data Licensing The JISC Linked Data Horizon Scan 7 states about the link between Linked Data and Open Data: Linked Data may be Open, and Open Data may be Linked, but it is equally possible for Linked Data to carry licensing or other restrictions that prevent it being considered Open, or for Open Data to be made available in ways that do not respect all of Berners-Lee s rules for Linking. [34] 6 bulary 7 8/08/

12 The concept of open data has risen in prominence in the UK following the government s 2009 / 2010 data.gov.uk initiative which aims to open access to public-sector government data, and slowly turn them into Linked Open Data where appropriate. In addition to those standards relevant to Linked Data, arguably the standards that are most critical to Open Data are open licences. Licensing of data needs to be explicit to avoid any ambiguity in terms of use and reuse. There are many differences worldwide relating to the copyright of data. For example in the European Union database rights exist automatically, whereas, in the USA, data is not covered by any existing copyright law. A guidance is available from a number of sources including the following: 1. Creative Commons licenses: 2. GNU Free Documentation License: 3. Open Data Commons licenses: 4. Science Commons Database Protocol: 5. Freedom to Research: Keeping Scientific Data Open, Accessible, and Interoperable: We briefly summarize some examples in the world of Linked Open Data. MusicBrainz 8, for instance, is a project that aims to create an open content music database by capturing information about artists, their recorded works, and the relationships between them. MusicBrainz s core data is in the public domain, and additional content including moderation data is placed under the Open Audio License. The server software is covered by the GNU General

13 DataLift: From Raw Publish Data To Interlinked Semantic Data Public License. The MusicBrainz client software library is licensed under the GNU Lesser General Public License. Guardian Data Store 9 is a growing database of open data sets used by the Guardian and available to download in a variety of formats. It supports active community engagement and sharing of how data sets are being used and reused. OpenStreetmap 10 provides free worldwide geographical information such as maps. Currently the site uses creative commons licensing, in particular the Creative Commons Attribution-ShareAlike 2.0 (CC-BY-SA) license 11. Global community contribution is actively encouraged. BBC Backstage 12 is a collection of open BBC data feeds and APIs available for re-mixing and mash-ups. The site aims to encourage participation and support creativity through open innovation. There are some licence restrictions (mainly around commercial use), which are clearly defined on the site. legislation.gov.uk 13 collects all the UK legislation, and it is managed by The National Archives on behalf of HM Government. The original (as enacted) and revised versions of legislation on legislation.gov.uk are published by and under the authority of the Controller of HMSO, and the Queen's Printer for Scotland. Data is released under the Crown Copyright, and the user is allowed to use and re-use the information (not including logos) free of charge in any format or medium, under the terms of the Open Government Licence 14. Crown copyright covers material created by civil servants, ministers, and government departments, and agencies. It is legally defined under section 163 of the Copyright, Designs and Patents Act 1988 as works made by officers or servants of the Crown in the course of their duties. Airports 15 is a collection of airport data from Our Airports published as RDF /08/

14 It is a project of the Data Incubator, which is focussed on creating and publishing Linked Data, particularly where that data is converted from preexisting sources. The ultimate goal for all datasets hosted on Data Incubator is to be (re-)adopted by their original owners. To support this, all code and documentation, including RDF vocabularies, published through Data Incubator is published as open source under liberal reuse licenses. DBpedia 16 is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia is derived from Wikipedia and is distributed under the same licensing terms as Wikipedia itself. As Wikipedia has moved to dual-licensing, also DBpedia is dual-license starting with release 3.4. DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license, and the GNU Free Documentation License. All DBpedia releases up to and including release 3.3 are licensed under the terms of the GNU Free Documentation License only. SW Dog Food 17 provides metadata describing events such as conferences in the field of the Semantic Web. This metadata covers information about papers, schedules, attendees, etc. Tools can then consume this information and provide services, such as intelligent scheduling or search, to conference attendees. The event datasets on SWDF are free to use for everyone with no restrictions or strings attached. In particular, they apply the Open Data Commons Public Domain Dedication and Licence, and Attribution-Sharealike Community Norms. LIBRIS 18 is the joint catalogue of the Swedish academic and research libraries, and is updated on a daily basis. The libraries providing cataloguing input all contribute jointly to building up the database contents. The cover pictures are protected according to the Swedish Act on Copyright for Literary and Artistic Works. Linked Clinical Trials (LinkedCT) 19 is a project that aims at publishing the first open Semantic Web data source for clinical trials data. The data exposed by LinkedCT is generated by (1) transforming existing data sources of clinical

15 DataLift: From Raw Publish Data To Interlinked Semantic Data trials into RDF, and (2) discovering links between the records in the trials data and several other data sources. Linked Clinical Trials Data is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Canada License. Permissions beyond the scope of this license are available at The following table provides an overview of the above examples, and their licensing terms. Creative Commons Open Data Commons Copyright of the country GNU Licenses Commercial licenses No licenses MusicBrainz Guradian Data Store OpenStreetmap BBC Backstage DBpedia Legislation.gov.uk Airports SW Dog Food LIBRIS Linked CT Summarizing, the absence of clarity for data consumers about the terms under which they can reuse a particular dataset prevents the reuse of that data. Therefore, all Linked Data on the Web should include explicit license or waiver statements, as claimed by Miller et al. [6] and Heath & Bizer [50]. In Section 3, we analyse the Creative Commons licenses [3],[30], the Open Data Commons licenses [6], the MPEG-21 Rights Expression Languages [31], the Waiver vocabulary [32], and some recent developments of interest for this report on access control for the Web, and in particular for Linked Data. 8/08/

16 Organization of the Report This report is organized as follows. In Section 2, we present the different approaches to represent the provenance of the information in RDF. Section 3 is concerned with the analysis of the approaches for introducing the licences, and recent access control approaches in the field of the Semantic Web. 16

17 Serena Villata Fabien Gandon Data Provenance As stated by Bizer et al. [56], Linked Data should be published alongside several types of metadata, in order to increase its utility for data consumers. To enable clients to assess the quality of published data and to determine whether they want to trust data, data should be accompanied with metainformation about its creator, its creation date as well as the creation method. Data sources should publish provenance meta-data together with the data itself. Such meta-data is expressed by means of RDF triples describing the document in which the original information is contained. There are several ways to provide basic provenance meta-information. For instance, one can adopt the FOAF vocabulary, the Dublin Core terms or the Semantic Web Publishing vocabulary. Moreover, we can consider the Open Provenance Model, the Provenir ontology, the SWAN-SIOC alignment, the Provenance vocabulary, the Data Catalog vocabulary, the Changeset vocabulary, and RDF coloring. In this report, we concentrate on the modelling issues concerning data provenance. However, another issue, which needs to be taken into account in the context of provenance, is the protocol and the access to the provenance information in the Web of Data [51],[52]. Provenance-relevant metadata is either directly attached to a data item or its host document or it is available as additional data on the Web. Examples for attached metadata are RDF statements about an RDF graph that contains the statements, author and creation date of blog entries added to a syndication feed. Attached metadata and detached metadata may be represented in RDF using vocabularies, as we will describe in the reminder of this report. Accessing data on the Web is often based on HTTP URIs. Since these URIs are grounded in the Domain Name System, it is possible to query a WHOIS [53] service in order to get provenance information about the accessed data item. However, the responses of WHOIS services are hardly usable for automatic evaluation. A source of data about the content provided by a Web server are sitemaps. A sitemap is an XML document that informs search engine crawlers about URLs on a website. Semantic sitemaps [54] extend the documents with information 8/08/2011

18 about the location of RDF data and about alternative means to access this data (e.g., data dumps and SPARQL endpoints). If the data provider follows the linked data principles, a look-up of these URIs will yield RDF-based metadata that may describe provenance information about the datasets. In linked data, as with the URI of an RDF dataset it is possible to look up any HTTP URI that represents a piece of provenance. Another method to discover metadata about Web resources is POWDER [43], as introduced in the previous section. A linked dataset can be provided as an RDF dump, that is, an RDF document which contains the whole linked dataset. Usually, an RDF dump represents a linked dataset as a single RDF graph. For instance, Hartig and Zhao [52] propose to augment this graph with provenance metadata about itself. In this case the added provenance metadata describes the provenance of the whole dataset and, thus, is likely to be similar to the information provided with a void 20 description for the dataset. In addition to this information the metadata should also describe the provenance of the RDF dump itself. It is also possible to serialize a linked dataset as a collection of Named Graphs [11]. In this case, each of these graphs could contain provenance metadata about itself. Dublin Core The Dublin Core [16] metadata standard is a simple yet effective element set for describing a wide range of networked resources. It comprises fifteen elements, the semantics of which have been established through consensus by an international, cross-disciplinary group of professionals from the scholarly fields of librarianship, computer science, text encoding, museum, and archive management, among others. Dublin Core has as its goals the following characteristics: Simplicity of creation and maintenance: the Dublin Core element set has been kept as small and simple as possible to allow a non-specialist to create simple descriptive records for information resources, while providing

19 DataLift: From Raw Publish Data To Interlinked Semantic Data for effective retrieval of those resources in the networked environment. Commonly understood semantics: discovery of information across the vast commons of the Web is hindered by differences in terminology and descriptive practices from one field of knowledge to the next. The Dublin Core can help the user find her way by supporting a common set of elements, which are considered as universally understood and supported. For example, scientists concerned with locating articles by a particular author agree on the importance of a creator element. International scope: although the specific linguistic challenges of the Web have not been directly addressed by the Dublin Core development community, the involvement of representatives from almost every continent has ensured that the development of the standard considers the multilingual and multicultural nature of the electronic information universe. Extensibility: while balancing the needs for simplicity in describing digital resources with the need for precise retrieval, Dublin Core developers have recognized the importance of providing a mechanism for extending the DC element set for additional discovery needs. Other communities of metadata experts will create additional metadata sets. Metadata elements from these sets could be linked with Dublin Core metadata to meet the need for extensibility. This model allows different communities to use the DC elements for core descriptive information, which will be usable across the Internet, while allowing domain specific additions making sense within a more limited arena. The Dublin Core is also a widely deployed vocabulary for representing metadata about the sources, in particular with the dc:creator, dc:publisher, dc:date predicates. When the predicates dc:creator, dc:publisher are used in the context of Linked Data, the user is expected to use the URIs, and not the literal names identifying the creator and publisher. This allows other people to unambiguously refer to them, and connect these URIs with background information about them, which is available on the Web, and might be used to assess the quality and trustworthiness of published data. 8/08/

20 In particular, the predicate dc:creator defines an entity primarily responsible for making the resource. Examples of a creator include a person, an organization, or a service. Typically, the name of a creator is used to indicate the entity. The same holds for the predicate dc:publisher, which is an entity responsible for making the resource available. Examples of a publisher include a person, an organization, or a service, and typically, the name of a publisher is used to indicate the entity. Moreover there are the elements Audience, Provenance and RightsHolder not defined as part of the Simple Dublin Core fifteen elements, but they are part of the 12 elements defined in the Qualified Dublin Core [25]. The details of these three elements are the following: Audience: a class of entity for whom the resource is intended or useful. A class of entity may be determined by the creator or the publisher or by a third party. The guidelines for content creation underline that the audience terms are best utilized in the context of formal or informal controlled vocabularies. Some examples are Audience="elementary school students", Audience="ESL teachers", Audience="deaf adults". Provenance: a statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity and interpretation. The statement may include a description of any changes successive information providers made to the resource. Some examples are Provenance="This copy once owned by Benjamin Spock", Provenance="Estate of Hunter Thompson", Provenance="Stolen in 1999; recovered by the Museum in 2003". RightsHolder: a person or organization owning or managing rights over the resource. Recommended best practice is to use the URI or name of the Rights Holder to indicate the entity. The guidelines for content creation underline that, since for the most part, people and organizations are not assigned URIs, a person or organization, holding rights over a resource would be named using a text string. Some examples are RightsHolder="Stuart Weibel", and RightsHolder="University of Bath". 20

21 DataLift: From Raw Publish Data To Interlinked Semantic Data Semantic Web Publishing The Semantic Web Publishing vocabulary (SWP) [26] is an RDF-Schema vocabulary for expressing information provision related meta-information and for assuring the origin of information with digital signatures. The Semantic Web Publishing vocabulary is designed for information syndication processes in which information is passed through multiple intermediaries. These syndication processes imply three basic roles: Information Providers publish information in various forms. They have different degrees of commitment towards published information, e.g. they might believe information to be true or might be in doubt about the reliability of published information. In order to prove the origin of information and to ensure that information is not altered in the syndication process, information providers can digitally sign information. Information Syndicators are intermediaries who collect information from multiple information providers and distribute collected information to information consumers or other syndicators. Information syndicators might add meta-information about the syndication process to syndicated information. They are not committed to the truth of information, as they are merely quoting other sources. Information Consumers receive information directly from information providers or through information syndicators. For assessing the quality of received information, information consumers require meta-information about the origin of information and the syndication process. In order to verify the origin of information, information consumers might require information to be digitally signed. Figure 1 gives an overview of the Semantic Web Publishing Vocabulary. The vocabulary consists of two parts: the first part defines terms for authorizing information and for representing information provision related meta-information, and the second part defines terms for representing digital signatures. 8/08/

22 The basic idea of the SWP vocabulary is to record the authorizing relationship between a named graph [11], and an authority in the form of a warrant. An authorizing relationship means that the authority in some sense commits itself to the content of the graph. The SWP vocabulary provides terms for representing different propositional attitudes, such as asserting or quoting, towards a graph. Warrants may also record other properties of an authorizing relationship such as the validity or the expiry date. Figure 1 - The Semantic Web Publishing vocabulary. Figure 2 gives an overview of the SWP terms for authorizing graphs. The swp:authority property relates warrants to authorities. The swp:assertedby and swp:quotedby properties capture the propositional attitude of the relationship between a graph and a warrant. These take a named graph as a subject and a swp:warrant as object; swp:authority takes a warrant as a subject and a swp:authority as an object. Each warrant must have a unique authority. Roughly, swp:assertedby means that the warrant records an endorsement or assertion that the graph is true, while swp:quotedby means that the graph is being presented without any comment being made on its truth. 22

23 DataLift: From Raw Publish Data To Interlinked Semantic Data Figure 2 - SWP terms for authorizations. As underlined above, in order to prove the origin of information and to ensure that information is not altered in the syndication process, information providers may decide to digitally sign named graphs. These named graphs are typed as instances of the class rdfg:graph. Computing a digital signature for a large amount of data is usually expensive. Therefore, it is common practice to calculate a digest of the data. Carroll [40] proposes an algorithm for transforming semantically equivalent graphs into a canonical serialization. The algorithm renames blank nodes in a uniform fashion, and sorts triples into a lexical order. After canonicalizing a graph, its digest can be computed from the canonical serialization using a standard hash function like SHA1, or MD5. The Semantic Web Publishing vocabulary provides terms for representing digital signatures, for indicating the signature method that was used to compute a signature, and for representing cryptographic keys and certificates. The signature-related terms of the SWP vocabulary are provided in Figure 3. Graph signatures are attached to warrants using the swp:signature property. The value of the swp:signature property is an RDF literal representing the signature of the graph that is asserted or quoted by the warrant. The swp:signaturemethod property identifies the signature 8/08/

24 method that was used to calculate the signature. Figure 3 - Signature-related terms of SWP. The difference between the Dublin Core vocabulary and the Semantic Web Publishing vocabulary lies in types of representable relations: SWP is focused on the commitment of an authority towards the truth of information, and asserting a graph implies a claim by the authority that the content of the graph is true. In contrast, the Dublin Core terms focus on the role of a person or institution in the process of creating an information resource. A second difference lies in the way both vocabularies are used within RDF. The Dublin Core elements are used as predicates of RDF triples describing a resource, for instance <resource> dc:creator "Name". The SWP vocabulary captures a relationship between an authority and an information resource using warrants as an additional level of indirection. This reification of the relationship allows the relationship to be described using additional properties, such as validity and expiry date. Moreover, another difference between the Dublin Core and the SWP is that the SWP vocabulary applies to graphs while the Dublin Core usually applies to concrete resources. 24

25 DataLift: From Raw Publish Data To Interlinked Semantic Data Open Provenance Model The Open Provenance Model vocabulary (OPMV) [10],[27] is a lightweight provenance vocabulary aiming to provide terms to enable practitioners of data publishing to publish their data. The Open Provenance Model vocabulary is closely based on the community provenance data model, the Open Provenance Model (OPM) [5],[10]. OPMV can be used together with other provenance related RDF/OWL vocabularies/ontologies, such as Dublin Core, FOAF, the Changeset Vocabulary, and the Provenance Vocabulary. As being grounded on OPM, the OPMV aims to assist the interoperability between provenance information on the Semantic Web. The Open Provenance Model Vocabulary is defined as an OWL-DL ontology and it is partitioned into a core ontology and supplementary modules. In order to avoid making the core ontology too complex, the core module only implements structures defined in OPM and the supplementary modules provide less frequently used terms and a broad range of specializations of the core terms. The Open Provenance Model is a model of provenance that is designed to meet the following requirements: 1. To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. 2. To allow developers to build and share tools that operate on such a provenance model. 3. To define provenance in a technology-agnostic manner. 4. To support a digital representation of provenance for any thing, whether produced by computer systems or not. 5. To allow multiple levels of description to coexist. 6. To define a core set of rules that identify the valid inferences that can be made on provenance representation. In particular, the OPM vocabulary defines three classes: 8/08/

26 Agent: a contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, or affecting its execution. Artifact: a general concept that represents an immutable piece of state, which may have a physical embodiment in a physical object, or a digital representation in a computer system. Process: an action or series of actions performed on or caused by artifacts, and resulting in new artifacts. Figure 4 visualizes the classes and properties of the OPM vocabulary. Figure 4 - The Open Provenance Model vocabulary. The Open Provenance Model allows us to characterize what caused things to be, i.e., how things depended on others and resulted in specific states. In essence, it consists of a directed graph expressing such dependencies. A primary concern is to be able to represent how things came out to be in a given state, with a given set of characteristics, at a given moment. It is recognized that many of such things can be stateful, e.g., a file can contain different data at different moments of its existence. Hence, from the perspective of provenance, OPM introduces the concept of an artifact as an 26

27 DataLift: From Raw Publish Data To Interlinked Semantic Data immutable piece of state 21 ; likewise, it introduces the concept of a process as actions resulting in new artifacts. A process usually takes place in some context, which enables or facilitates its execution: examples of such contexts are varied and include a place where the process executes, an individual controlling the process, or an institution sponsoring the process. These entities are being referred to as Agents. Agents are a cause of a process taking place. The FOAF Vocabulary The FOAF [22] is a linked information system. It has been designed to allow for integration of data across a variety of applications, Web sites and services. It provides a basic vocabulary of terms for talking about people and the things they make and do. The FOAF project is based around the use of machine-readable Web homepages for people, groups, companies and other kinds of thing. To achieve this, the FOAF vocabulary provides a collection of basic terms that can be used in these Web pages. At the heart of the FOAF project is a set of definitions designed to serve as a dictionary of terms that can be used to express claims about the world. The initial focus of FOAF has been on the description of people, since people are the things that link together most of the other kinds of things described in the Web, e.g., they make documents, attend meetings, are depicted in photos, and so on. The FOAF Vocabulary definitions are written using RDF/OWL, that makes it easy for software to process some basic facts about the terms in the FOAF vocabulary, and consequently about the things described in FOAF documents. A FOAF document can be combined with other FOAF documents to create a unified database of information. FOAF is a Linked Data system, in 21 Concerning the notion of artifact as an immutable piece of state, the W3C Provenance Working Group is introducing the notion of Invariant View or Perspective on a Thing (IVP) which is a relation between two things A and B. B is an IVP of A" if A and B represent the same thing in the real world, and the thing states modelled by A and B are consistent. For more details about IVP, see _.28IPV.29 8/08/

28 that it is based around the idea of linking together a Web of decentralised descriptions. The idea is the following. If people publish information in the FOAF document format, machines will be able to make use of that information. If those files contain "see also" references to other such documents in the Web, the result is a machine-friendly version of today's hypertext Web. FOAF files are text documents, written in XML syntax, and adopt the conventions of RDF. The FOAF vocabulary defines constructs that appear in FOAF files, alongside other RDF vocabularies defined elsewhere. Among others, FOAF defines classes such as foaf:person, foaf:document, foaf:image, alongside some properties of those things, such as foaf:name, foaf:mbox (i.e., an box), foaf:homepage, as well as some kinds of relationship that hold between members of these categories. An extension of this set of relationships may be found in the RELATIONSHIPS 22 ontology, which is a vocabulary for describing relationships between people. For instance, a relationship type is foaf:depiction. This relates e.g., a foaf:person to a foaf:image. Figure 5 provides some of the terms in the FOAF vocabulary. Figure 5 - Some of the FOAF terms grouped in categories. The FOAF property foaf:marker is relevant for data provenance since the

29 DataLift: From Raw Publish Data To Interlinked Semantic Data marker is the agent that made the thing. This vocabulary has not being defined with the precise intent to represent the provenance of the information, and only the marker property in the vocabulary satisfies the aim of representing provenance. SWAN-SIOC Alignment The SWAN-SIOC alignment [19] provides a way for representing the provenance of the information in the field of scientific discourse. The SWAN (Semantic Web Applications in Neuromedicine) project attempts to model Scientific Discourse about Alzheimer's disease. The SIOC (Semantically- Interlinked Online Communities) ontology can be used in the Health Care and Life Sciences context as a means to represent the various interactions happening in these communities and their related content. SWAN and SIOC act in a complementary way: SWAN provides fine-grained modelling of scientific discourse elements while SIOC can represent more generic contributions in online communities. The alignment between the SWAN and SIOC ontologies provides a complete model to represent Scientific Discourse in online communities at different levels of granularity (discourse elements and content items). In order to define the alignments between the SWAN and SIOC ontologies, a special module for SIOC has been created (the SWAN/SIOC Module). This module contains the different statements defining the alignments (subclasses, sub-properties) between the original ontologies in a machinereadable form, i.e. RDF. Figure 6 identifies the various relationships defined in this module. For further details about the classes and the properties defined by SWAN/SIOC, see [19]. 8/08/

30 Figure 6 - The SWAN-SIOC alignment. The Provenir Ontology The Provenir ontology [41], from the French word provenir, is a common provenance model, which forms the core component of a modular approach to provenance management framework in escience. Domain-specific details are an important component of provenance representation, but, a single provenance ontology that models all possible details from different domains may be not feasible. The Provenir approach is a modular approach which involves the integrated use of multiple ontologies, each modelling provenance metadata specific to a particular domain. These multiple ontologies use the Provenir ontology as the common reference model, hence simplifying their interoperability. This modular framework provides a scalable approach to provenance modelling that can be adapted to the specific requirement of different domains. Figure 7 visualizes the classes and the 30

31 DataLift: From Raw Publish Data To Interlinked Semantic Data properties of the Provenir ontology. To represent provenance metadata classes, Provenir uses the two primitive concepts of occurrent and continuant from philosophy. A continuant is defined as entities which endure, or continue to exist, through time while undergoing different sorts of changes, including changes of place. An occurrent is defined as entities that unfold themselves in successive temporal phases. Three base classes are defined in the Provenir ontology representing the primary components of provenance: data, agent and process. The two base classes, data and agents are defined as specialization (sub-class) of continuant class. The third base class process is a synonym of occurrent. Figure 7 - The Provenir ontology. The definition of each class is as follows: data: this class models continuant entities that represent the starting 8/08/

32 material, intermediate material, end products of a scientific experiment, and parameters that affect the execution of a scientific process. Data inherit the properties of continuants such as enduring or existing while undergoing changes. process: this class models the occurrent entities that affect (process, modify, create, delete among other dynamic activities) individuals of data. agent: this class models the continuant entities that causally affect the individuals of process. In addition to these three base classes, five sub-classes of data are defined: data_collection, parameter, temporal_parameter, spatial_parameter, and domain_parameter. We do not analyse in details these subclasses because they are out of the scope of this report. A set of foundational properties are defined in the Provenir ontology, adapting the properties defined in the Relation ontology from the Open Biomedical Ontologies Foundry. Some of them are listed below: part_of: this property is defined for each of the three base classes of the Provenir ontology. The restriction for this property is that the domain and range values belong to the same class. contained_in: the property is defined for data and agent classes. has_participant: this is the primary property linking data to process, where the individual of data class participates in a process. has_agent: this is a causal property that links agent to process where the agent is directly responsible for the change in state of the process. The Provenir ontology also allows the use of this property to capture the directionality of scientific experiments, for example which agent caused the activation of a process. 32

DataLift: From Raw Publish Data To Interlinked Semantic Data The Provenance Vocabulary The Provenance Vocabulary [24] is defined as an OWL ontology, and it is partitioned into a core ontology and

33 DataLift: From Raw Publish Data To Interlinked Semantic Data The Provenance Vocabulary The Provenance Vocabulary [24] is defined as an OWL ontology, and it is partitioned into a core ontology and supplementary modules. To avoid making the core ontology too complex, the modules provide less frequently used concepts and a broad range of specializations of the core concepts. Three supplementary modules are provided: Types, Files and Integrity Verification. The development of this vocabulary is motivated by the need to describe the main aspects of provenance of data consumed from the Web. Two main dimensions of provenance have been identified: data creation and data access. More general concepts, such as actors, processes, and artifacts, are relevant in both these dimensions. Consequently, the Provenance Vocabulary consists of three parts: general terms, terms for data creation, and terms for data access. Figure 8 provides an overview of the Provenance vocabulary. Figure 8 - The Provenance vocabulary. The general terms include classes for the general types of provenance elements: Actor, Execution and Artifact. Actor has sub-classes HumanActor and NonHumanActor; Artifact has sub-classes DataItem and File. 8/08/

Furthermore, the general terms include properties that relate individuals of the general classes with each other: an Artifact was yieldedby an Execution which may have used further employedartifacts.

34 Furthermore, the general terms include properties that relate individuals of the general classes with each other: an Artifact was yieldedby an Execution which may have used further employedartifacts. An Execution was performedat a specific time, and it was performedby an Actor, and it might have had other involvedactors. An Artifact might have been serializedby a file. The data access dimension focuses on retrieving data items from the Web. Using the data access terms in provenance descriptions is, in particular, recommended to provide information about the retrieval of source data items and of creation guidelines. The Provenance vocabulary allows to describe how a DataItem has been retrievedby the execution of a DataAccess. The retrieved DataItem is a Web representation of the accessedresource. The accessedservice is a DataProvidingService which was usedby the DataPublisher; furthermore each DataProvidingService is usually operatedby a HumanActor. To allow for a wide range of applications the vocabulary does not prescribe a specific granularity by which provenance information has to be described. Hence, the classes in the core ontology are quite general. For instance, a DataItem could be a single RDF triple or it could be a specific RDF graph that represents a Linked Data object as in the example above. More specific specializations of the general classes are provided with the types module. However, while this vocabulary, including its modules, provides a basic framework to describe the provenance of data from the Web it does not aim to support the description of every aspect and detail of provenance. The Changeset vocabulary The Changeset vocabulary [20] defines a set of terms for describing changes to resource descriptions. The vocabulary introduces the notion of a 34

35 DataLift: From Raw Publish Data To Interlinked Semantic Data ChangeSet, which encapsulates the delta between two versions of a resource description. In this context a resource description is the set of triples that in some way comprise a description of a resource. The delta is represented by two sets of triples: additions and removals. A ChangeSet can be used to modify a resource description by first removing all triples from the description that are in the removals set and adding the triples in the additions set. In particular, the Changeset vocabulary defines the property creatorname : a property representing the name of the entity responsible for creating the changeset. Having this property implies being a ChangeSet. Every value of this property is an rdfs:literal. This property allows to identify the provenance of a ChangeSet. The vocabulary defines also the property precedingchangeset: a property representing the changeset that immediately precedes this one. This property can be used to build a history of changes to a particular resource description. The first ChangeSet in the history will have no precedingchangeset property. Each subsequent ChangeSet added to the history references the preceding one resulting in a singly-linked list of changes. This property can be used to have the full provenance information of a resource. 8/08/

36 Data Catalog Vocabulary The Data Catalog Vocabulary 23, starting from the DCAT vocabulary 24 visualized in Figure 9, is an RDF vocabulary for the exchange of data catalogues. Its primary purpose is the expression of government data catalogues, such as data.gov or data.gov.uk, in RDF. It is being produced by the W3C egovernment Interest Group. Figure 9 - The DCAT Vocabulary In particular, this vocabulary introduces, among others, the following classes : dcat:catalog: a data catalog is a curated collection of metadata about the datasets

37 DataLift: From Raw Publish Data To Interlinked Semantic Data dcat:dataset: a collection of data, published or curated by a single source, and available for access or download in one or more formats. Concerning data provenance and licensing, the vocabulary introduces the following properties, which are useful for determining the provenance of the data in the Web of Data, and the license under which the data is released: dct:publisher: the entity responsible for making the catalogue or the dataset available. dcat:licence: this property describes the license under which the catalogue can be used or reused, and not the datasets. Even if the license of the catalogue applies to all of its datasets it should be replicated on each dataset. dct:license: this property describes the license under which the dataset is published and can be reused. In this vocabulary, there is a clear separation between metadata and provenance information related to a dataset, the catalogue record, and the catalogue itself, which is also depicted as a unique element in the ontology. RDF Coloring Flouris et al. [21] propose the use of colors in order to capture the provenance of RDF data and schema triples. The idea is that the color of an implicit and explicit triple represents the source from which the triple was obtained. They record the color of an RDF triple as a fourth column, hence obtaining an RDF quadruple. Figure 10 provides the intuition of the granularity levels of provenance for RDF captured with this work. 8/08/

Figure 10 - An example of RDF coloring. The authors compare their approach with the representation of provenance information in the relational context, e.g., [29] use colors to capture the provenance of relational tables, tuples and attributes.

38 Figure 10 - An example of RDF coloring. The authors compare their approach with the representation of provenance information in the relational context, e.g., [29] use colors to capture the provenance of relational tables, tuples and attributes. If we consider that a relational tuple of the form [a1 :v1,..., ak :vk ] with tuple identifier tid corresponds to a set of triples (tid, aj, vj), j=1,,k, as shown in Figure 9. A color assigned to (a) a single triple captures provenance at the level of an attribute of the relational tuple; (b) a collection of triples sharing the same subject captures provenance at the level of the relational tuple, and finally (c) a set of triples whose subjects are instances of the same schema class, captures provenance of the relational table. The quadruples used to represent colored RDF/S triples leverage the syntax of RDF Named Graphs [11]: an RDF named graph is modeled by arbitrary sets of triples sharing the same color. Flouris et al. [21] use a logical representation of quadruples to store the color of an RDF triple. The use of colors allows them to capture provenance at several granularity levels. The authors provide an extension of RDFS inference rules to determine the provenance of implicit RDF triples. Provenance Working Group A comparison of different provenance vocabularies as well as further resources regarding the publication of provenance information are available 38

39 DataLift: From Raw Publish Data To Interlinked Semantic Data on the website of the W3C Provenance Working Group 25. In the following we summarize the main ideas that have been discussed by this Working Group. The need of representing provenance information has led to the creation of a number of provenance models that cater to a diverse set of application domains. The provenance models often reflect the requirements of the user community that developed the model. For example, the Open Provenance Model (OPM) was developed as a generic provenance model in the context of workflow provenance whereas the Provenir ontology was developed as an upper-level provenance ontology for use in Semantic Web applications, and the Provenance Vocabulary was proposed for publishing provenance aware content using the Linked Data principles. Given the large number of such models it is useful to identify the correspondence between their respective provenance terms and how they help the users to better understand the similarities and differences between the provenance terminologies, facilitate the development of applications that can utilize the mappings for provenance interoperability, and enable the provenance research community to move towards the adoption of a common provenance terminology. As core provenance terms, the Working Group has selected the provenancerelated terms in OPM. The first result of this comparison is that many of the considered models and vocabularies have a set of core concepts that correspond to the notion of processes, artifacts, and agents as defined in OPM. These concepts can be mapped quite naturally between the models. While the modelling of these concepts indicates a process-centric view, several vocabularies take a resource-centric view. The group has experienced difficulties for vocabularies that use relationships between entities as shortcuts for a detailed process description. A provenance description could introduce a process that represents the contribution process, and it would make things more verbose than necessary. Hence, resource-centric terms are important shortcuts to complement process-centric provenance vocabularies, i.e., they allow for a more compact representation of provenance Page 8/08/

40 Several vocabularies provide non-causal relationships, something explicitly left out of OPM. For instance, Provenir includes the property provenir:preceded_by to represent temporal order or provenir:adjacent_to for spatial proximity. The Provenance Vocabulary allows users to describe who was responsible for a data providing service that was accessed during the execution of a data access process. While it can be argued whether such relationships are provenance related or not, they may be of great value in several application areas of provenance descriptions such as provenance based measurement of trustworthiness of content or information quality assessment in general. However, the definition of noncausal relationships in provenance vocabularies should ensure that no conflict with causality can result. Concerning time, the OPM does not represent time related properties explicitly as one of the terms defined for node types and their relationships. It enables the specification of time constraints using timestamps attached to relationships. Further aspects of provenance that are not well captured by OPM and, thus, missing from the core provenance terms are: versioning, a notion of artifact identity persisting across transformations, containment relationships and collections, and cryptographic hashes and digital signatures. Some of the considered vocabularies introduce rich sets of useful concepts for these aspects. In many cases these concepts can be seen as sub-types of OPM terms. To preserve their rich expressiveness, a systematic structuring of these concepts, per application domain, in the form of OPM profiles, would be necessary. Similarly, better bridges between OPM and vocabularies that are already standardized and strongly adopted (e.g., Dublin Core), but do not have the full expressiveness of OPM, would be desirable. 40

41 DataLift: From Raw Publish Data To Interlinked Semantic Data 8/08/

42 Data Licensing Bizer et al. [56] observe about licensing in the Web of Data: the applications that consume data from the Web must be able to access explicit specifications of the terms under which data can be reused and republished. Availability of appropriate frameworks for publishing such specifications is an essential requirement in encouraging data owners to participate in the Web of Data, and in providing assurances to data consumers that they are not infringing the rights of others by using data in a certain way. Initiatives such as the Creative Commons have provided a framework for open licensing of creative works, underpinned by the notion of copyright. However, Miller et al. [6], underline that the copyright is not applicable to all data because not all data are creative works. Therefore frameworks such as the Open Data Commons licences should be adopted to state the revise conditions. An analysis about data licensing in France has been provided by the Fing in a recent report 26. They selected eight legal cases. The default position is that, if the public actor chooses to be compliant with the law, it is not necessary to precise which law, but nevertheless it will facilitate the life of re-users. Note that the decision to charge restrictive conditions of reuse requires the production of a license. Other legal frameworks set out the conditions for reuse: Terms of APIE 27 (October 2010), a pedagogic reformulation of the framework of law of The types of licenses-apie 28, which are frames to assert that the data reuse is subject to special conditions (e.g., update) and / or the payment of a fee. The license ODbL 29 (also known as ODC-ODbL, 2010), an publiques/view

43 DataLift: From Raw Publish Data To Interlinked Semantic Data international license issued by the Open Data Commons organization. The license ODC-By 30 (2010) which is also produced by the Open Data Commons organization, but it does not require the share alike restriction as in ODbL. The license PDDL (2010) which is also produced by the Open Data Commons organization as ODbL. Creative Commons licences 32 (since 2004), a group of licenses and definitions for an author to facilitate the reuse of his work, in the field of the legislation on copyright. They can be applied to the field of literary and artistic works, but these licenses seem less suited to the field of databases. The IP license from the Department of Justice 33 (May 2010). One of the first licenses in France inspired by Creative Commons licenses, but suited for databases, and, in particular, public data. These different licenses have common features, but also differ from each others. The requirement to mention the author (BY) seems to be one of the best shared features, since it is absent only in the license PDDL which in essence, exempt any obligation. Most legal frameworks allow commercial use: that is, they make it possible for re-users to sell public data without transforming or enriching them. The license IP is an exception and prohibits reuse as is of its data for commercial purposes: to make a profitable commercial reuse, the data must be enriched in some way. Some features are adopted by some legal framework, and not by others. These features establish a kind of split on some of the major issues in data reuse. The French law of 1978 requires re-users, unless explicit consent, to indicate the data sources and the updates of data, and maintain the integrity of data. Some licenses facilitate the removal of one or another of these /08/

44 requirements. In particular, allowing the data modification is an explicit feature of the Open Data Commons licenses. This feature is well suited to data from multiple sources, and whose development requires adjustments. It simplifies the reuse of data in the sense that changing a given author does not have to seek for permission. This applies, for example to data from OpenStreetMap: the map is constantly enriched and modified by tens of thousands of contributors; it is simple and fast to allow each contributor to enrich the map in his spare time. Some licenses facilitate the removal of one or another of these obligations, e.g., the requirement to mention the date of the last update of the data is not provided for the licensing of the Open Knowledge Foundation. Other features also differentiate certain licenses. The ban on commercial use of data is a feature of some Creative Commons licenses (such as CC-NC), the only ones that adopt it. The data covered by this obligation cannot be reused in a commercial setting. In this section, we analyse the Creative Commons licenses [3],[30], the Open Data Commons licences [6], the MPEG-21 Rights Expression Languages [31], the Waiver vocabulary [32], and some recent developments of interest for this report on access control for the Web, and in particular for Linked Data. Creative Commons The Creative Commons Rights Expression Language (ccrel) [3],[30] is the standard recommended by Creative Commons (CC) for machine-readable expression of copyright licensing terms and related information. ccrel is based on RDF. Creative Commons was publicly launched in December 2002, but its genesis traces to summer 2000 and discussions about how to promote a reasonable and flexible copyright regime for the Web. There was no standard legal means for creators to grant limited rights to the public for online material, and obtaining rights often required difficult searches to identify rightsholders and transaction costs to negotiate permissions. 44

45 DataLift: From Raw Publish Data To Interlinked Semantic Data A core issue is the ability for machines to detect and interpret the licensing terms in an automatic way. As stated in [3], equally important is constructing a robust user-machine bridge for publishing and detecting structured licensing information on the Web, and stimulating the emergence of tools that lower the barriers to collaboration and remixing. An example, visualized in Figure 11, of use of the licenses presented in [3] is the following. Consider Lawrence Lessigs blog, a document identified by its URL is licensed under the Creative Commons Attribution license. That license is also a document, identified by its own URL The property of being licensed under, which is called license can itself be considered a Web object and identified by a URL. This URL is which refers to a Web page that contains information describing the license property. The RDF triple that describes the license for Lessigs blog could be represented graphically as a point (the subject) labelled with the blog URL, a second point (the value) labelled with the license URL, and an arrow (the property) labelled with the URL that describes the meaning of the term license, running from the blog to the license. Figure 11 - An RDF Triple with a cc license. As an abstract specification, ccrel consists of a small but extensible set of RDF properties that should be provided with each licensed object. The abstract model for ccrel distinguishes two categories of properties: 1. Work properties describe aspects of specific works, including under which license a Work is distributed, and 2. License properties describe aspects of licenses. 8/08/

46 Publishers are normally concerned only with Work properties: this is the only information publishers provide to describe a Work licensing terms. License properties are used by Creative Commons itself to define the authoritative specifications of the licenses they offer. Other organizations are free to use these components for describing their own licenses. Such licenses, although related to Creative Commons licenses, would not themselves be Creative Commons licenses nor would they be endorsed necessarily by Creative Commons. The following License properties are defined as part of ccrel: cc:permits: permits a particular use of the Work above and beyond what default copyright law allows, cc:prohibits: prohibits a particular use of the Work, specifically affecting the scope of the permissions provided by cc:permits (but not reducing the rights granted under copyright), cc:requires: requires certain actions of the user when enjoying the permissions given by cc:permits, cc:jurisdiction: associates the license with a particular legal jurisdiction, cc:deprecatedon: indicates that the license has been deprecated on the given date, cc:legalcode: references the corresponding legal text of the license. The possible values for cc:permits, i.e., the possible permissions granted by a CC License are: cc:reproduction: copying the work in various forms, cc:distribution: redistributing the work, cc:derivativeworks: preparing derivatives of the work. 46

47 DataLift: From Raw Publish Data To Interlinked Semantic Data The possible values for cc:prohibits, i.e., possible prohibitions that modulate permissions, but do not affect permissions granted by copyright law, are: cc:commercialuse: using the work for commercial purposes. The possible values for cc:requires are: cc:notice: providing an indication of the license that governs the work, cc:attribution: giving credit to the appropriate creator, cc:sharealike: when redistributing derivative works of this work, using the same license, cc:sourcecode: when redistributing this work (which is expected to be software when this requirement is used), source code must be provided. Creative Commons also encourages publishers to include additional triples giving information about licensed Works, e.g., the title, the name and URL for assigning attribution, and the document type, as shown in the example of Figure 12. Figure 12 - An example of cc RDF annotation. ccrel is meant to be independent of any particular syntax for expressing RDF triples. Creative Commons does not allow third parties to modify these properties for existing Creative Commons licenses. Publishers may use these properties to create new licenses of their own. 8/08/

48 Open Data Commons Miller et al. [6] claim that discussion of opening access to resources on the web often turns to the activities of Creative Commons. They underline that the legal protections upon which Creative Commons and other similar licenses rely, depend upon national and international legislation around copyright. Copyright protection applies to acts of creativity, i.e., creative works, and categorically does not extend either to databases nor to those non-creative parts of their content. Despite numerous cases in which individuals or organizations release data onto the Web and apply a Creative Commons or similar license to this, often there is no defensible legal basis. Miller et al. [6] suggest that if the aim is to release large quantities of data onto the Web with the explicit intention that it be used and reused, then a different solution is required. Miller et al. [6], [1] propose the Open Data Commons licenses 34. The Public Domain Dedication and Licence (PDDL) 35 allows the information provider to freely share, modify, and use this work for any purpose and without any restrictions. This license is intended for use on databases or their contents ( data ), either together or individually. Many databases are covered by copyright. These sets of rights, as well as other legal rights used to protect databases and data, can create uncertainty or practical difficulty for those wishing to share databases and their underlying data but retain a limited amount of rights under a some rights reserved approach to licensing as outlined in the Science Commons Protocol for Implementing Open Access Data [33]. The Attribution License (ODC-By) 36 has been developed for the databases, and it allows to copy, distribute, and use the database, to produce works from the database, and to modify, transform and build upon the database. In the Open Database Licence (ODbL) 37, any public use of the database or works produced from the database must be attribute in the manner specified in the ODbL. For any use or redistribution of the database, or works produced from it, the license of the database has to be clear, and any notices on the original database must be kept intact. If any adapted

49 DataLift: From Raw Publish Data To Interlinked Semantic Data version of this database is publicly used, the adapted database has to be offered under the ODbL. As a result, the Open Data Commons waivers and licences try to eliminate or fully license any rights that cover this database and data. Any Community Norms or similar statements of use of the database or data do not form a part of the document, and do not act as a contract for access or other terms of use for the database or data. MPEG-21 REL In MPEG-21, a Rights Expression Language (REL) [31] is a machine-readable language that can declare rights and permissions using the terms as defined in the Rights Data Dictionary. The REL is intended to provide flexible, interoperable mechanisms to support transparent and augmented use of digital resources in publishing, distributing, and consuming of digital movies, digital music, electronic books, broadcasting, interactive games, computer software and other creations in digital form, in a way that protects digital content and honours the rights, conditions, and fees specified for digital contents. It is also intended to support specification of access and use controls for digital content in cases where financial exchange is not part of the terms of use, and to support exchange of sensitive or private digital content. The Rights Expression Language is also intended to provide a flexible interoperable mechanism to ensure personal data is processed in accordance with individual rights and to meet the requirement for users to be able to express their rights and interests in a way that addresses issues of privacy and use of personal data. A standard Rights Expression Language should be able to support guaranteed end-to-end interoperability, consistency and reliability between different systems and services. To do so, it offers richness and extensibility in declaring rights, conditions and obligations, ease and persistence in identifying and associating these with digital contents, and flexibility in 8/08/

supporting multiple usage/business models. The MPEG REL data model for a rights expression consists of four basic entities and the relationship among those entities.

Structurally, an MPEG REL grant consists of the following: The principal to whom the grant is issued, The right that the grant specifies, The resource to which the right in the grant applies, The

50 supporting multiple usage/business models. The MPEG REL data model for a rights expression consists of four basic entities and the relationship among those entities. This basic relationship is defined by the MPEG REL assertion grant. Structurally, an MPEG REL grant consists of the following: The principal to whom the grant is issued, The right that the grant specifies, The resource to which the right in the grant applies, The condition that must be met before the right can be exercised. Figure 13 visualizes the MPEG-21 REL data model. A principal encapsulates the identification of principals to whom rights are granted. Each principal identifies exactly one party. In contrast, a set of principals, such as the universe of everyone, is not a principal. A principal denotes the party that it identifies by information unique to that individual. This information has some associated authentication mechanism by which the principal can prove its identity. Figure 13 - The MPEG-21 REL. A right is the verb that a principal can be granted to exercise against some resource under some condition. Typically, a right specifies an action (or activity) or a class of actions that a principal may perform on or using the 50

51 DataLift: From Raw Publish Data To Interlinked Semantic Data associated resource. MPEG REL provides a right element to encapsulate information about rights and provides a set of commonly used, specific rights, notably rights relating to other rights, such as issue, revoke and obtain. A resource is the object to which a principal can be granted a right. A resource can be a digital work (such as an e-book, an audio or video file, or an image), a service (such as an service, or B2B transaction service), or even a piece of information that can be owned by a principal (such as a name or an address). A condition specifies the terms, conditions and obligations under which rights can be exercised. A simple condition is a time interval within which a right can be exercised. Waiver vocabulary The Waiver vocabulary [32] defines properties to use when describing waivers of rights over data and content. A waiver is the voluntary relinquishment or surrender of some known right or privilege. This vocabulary is designed for use with the Open Data Commons Public Domain Dedication and License and with the Creative Commons CC-0 waiver 38. The classes defined in this vocabulary are the following: declaration: a property representing a human readable statement describing the waiver in the context of the resource and the agent waiving their rights. Best practice is to include the name of the resource in which rights are being waived, and the name of the waiver. For example, To the extent possible under law, {{name or organization}} has waived all copyright and related or neighbouring rights to {{name of dataset}}. norms: a property representing the community norms for access and use of a resource. Norms are not legally binding, but represent the general /08/

52 principles or code of conduct adopted by a community for access and use of resources. Best practice is to use the URI of a document describing these norms as the value of this property. For example waiver: a property representing the waiver of rights over a resource. Best practice is use the URI of a waiver legal document as the value of this property. For example, or Related Vocabularies Two vocabularies which can also be used to define the licensing terms of the data on the Web are the Description of a Project vocabulary (DOAP) 39, and the Ontology Metadata vocabulary (OMV) 40 [55]. The former is an RDF/XML vocabulary to describe software projects, in particular open-source. It defines a property doap:license for defining the licensing terms of the project. The latter, instead, describes a particular representation of an ontology, and it captures the key aspects of the ontology metadata information, e.g., provenance, availability, statistics. OMV defines the property omv:haslicense which provides the underlying license model. Moreover, OMV introduces a class omv:licensemodel which describes the usage conditions of an ontology. Finally, we mention also the Dublin Core license document class dc:licensedocument 41 which provides the legal document giving official permission to do something with the resource and the license property dc:license

53 DataLift: From Raw Publish Data To Interlinked Semantic Data Recent Developments in Access Control for the Web and Linked Data In this section, we summarize some recent approaches to the issue of access control for the Web, and in particular for the Web of Data. Description Logic for Access Control The OASIS standard XACML is an XML-based language that is used to specify policies on web resources. XACML enables the use of arbitrary attributes in policies, allows for expressing negative authorization, conflict resolution algorithms and enables the use of hierarchical Role Based Access Control, among other things. With policy languages as XACML, a new issue has emerged: users have difficulty understanding the overall effect and consequences of their security policies. The most important feature in access control is checking that the access control policy will not result in the leakage of permissions to an unintended or unauthorized principal, i.e., safety. It has become difficult, if not impossible, to do it manually. For example, incomplete security policies might unintentionally give access to an intruder. How can a security administrator be certain that her policy covers all corner cases? An attempt to solve these issues is presented by Kolovski et al. [36] who provide a formalization of XACML that explores the ground between propositional logic analysis tools (such as Margrave) and full First-Order logic XACML analysis tools (like Alloy). As a basis for the XACML formalization, they use description logics, a family of languages that are decidable subsets of First-Order logic and are the basis for the Web Ontology Language (OWL). Because of the correspondence of policy analysis services to DL reasoning services (e.g., policy inclusion can be reduced to concept subsumption, whereas change impact analysis and verification can be reduced to concept satisfiability), the framework provides a variety of policy analysis services and leverages the availability of off-the-shelf DL reasoners optimized for these services. 8/08/

54 Another model for access control is RelBAC [9]. RelBAC (Relation Based Access Control) [9] is a model, and a logic with the overall goal of dealing with the problem on access control in Web 2.0 applications. The first feature of RelBAC is that its access control models can be designed using entityrelationship diagrams. Using RelBAC, social networks and object organizations can be modelled as lightweight ontologies, by exploiting the translation from classifications and Web directories to lightweight ontologies. This allows to model permissions as Description Logic (DL) roles, access control rules as DL formulas, and policies as sets of DL formulas and, therefore, to reason about access control simply by using DL reasoners. Context-dependent Access Control Abel et al. [2] present an architecture that integrates access control mechanisms based on Semantic Web policies with different kinds of RDF metadata stores. Given an RDF query, the framework partially evaluates all applicable policies and constraints the query according to the result of such evaluation. The modified query is then sent to the RDF store which executes it like a usual RDF query. The framework enforces fine-grained access control at triple level, i.e., all triples returned as a response to the query can be disclosed to the requester according to the policies in force. Using policies to restrict access to RDF statements requires being able to specify graph patterns (path expressions and boolean expressions), such as one can do in an RDF query. In addition, it is desired to have the ability of checking contextual properties, such as the ones of the requester (possibly to be certified by credentials) or time (in case access is allowed only in a certain period of time). 54

55 DataLift: From Raw Publish Data To Interlinked Semantic Data Figure 14 - Abel et al. [2] architecture. The architecture is visualized in Figure 14. It is composed of three main modules: Query Extension, Policy Engine and RDF Repository. The main task of the query extension is to rewrite a given query in a way that only allowed RDF statements are accessed and returned. It is in charge of querying the policy engine for each FROM clause of the original query, and expanding it with the extra path expressions and constraints. The implementation provides query extension capabilities for the SeRQL query language, which is a query language specific to Sesame 43. The policy engine is responsible for the policy evaluation. Input information (query context) such as the requester or disclosed credentials may be used as well. In the implementation, the Protune policy language and its framework is used. Finally, the extended query can be passed to the underlying RDF repository. Any store supporting SeRQL, such as Sesame, can be used. The result set returned contains only allowed statements and can be directly returned to the requester /08/

56 The AMO Ontology The AMO ontology [42] consists (i) in a set of classes and properties dedicated to the annotation of the resources whose access should be controlled, and (ii) in a base of inference rules modelling the access management strategy to carry out. When applied to the annotations of the resources whose access should be controlled, these rules enable to manage their access according to a given strategy. An overview of the ontology is visualized in Figure 15. Figure 15 - The AMO ontology. Access Policies for Linked Data Muhleisen et al. [7] present a Policy-enabled Linked Data Server (PeLDS). The main aim of such a system is to provide a semantic storage system allowing its users to specify which elements of their RDF graphs are published to which user. The authors create what they call a temporary view on the stored graphs that contains only those elements the querying user has been authorized to retrieve by the publishing user. The entire data stored is partitioned into datasets using named graphs [11]. The authors use a 56

57 DataLift: From Raw Publish Data To Interlinked Semantic Data general-purpose rule language to express our access policies. Three main concepts are defined: Action for query-related metadata, Rule to model single rules as a part of access policies, and TriplePattern for defined data classifications. The concepts UpdateAction and QueryAction are derived concepts to model different interaction types. Each action holds a user identifier, and a one-to-many relationship to the rules defined by the access policy. Each rule contains a reference to its data classifications within the TriplePattern instances. Figure 16 - The PeLDS system. An RDF graph containing the instances of the described schema is created for every request, and merged with the affected dataset according to the defined access policy. Each rule from the access policy is attributed with the consequence to add the rule identifier to a global list of matched rules. If such a rule matches due to sufficient access rights for the current user, it will be added to this list. The requested dataset is loaded into memory and the access policy is translated into instances of the policy schema and is added to the dataset. The list of rules is evaluated for the information present in the dataset together with the user identity given in the Action instance. If a rule matches, every triple matching the data classifications contained in the rules consequence predicate list is copied from the dataset to the result graph. The users query is executed on the result graph, and the query results are 8/08/

58 sent back to the user. This process is depicted in Figure 16. No further details are provided by the authors to specify the proposed approach. Sacco and Passant [8] propose the Privacy Preference Ontology (PPO) 44, a lightweight vocabulary on top of the Web Access Control ontology [47] aiming at providing users with means to define fine-grained privacy preferences for restricting, or granting access specific RDF data. The vocabulary provides the ability to restrict access to: (i) a particular statement; or (ii) to a group of statements (i.e., an RDF graph); or (iii) to a resource, either as a subject or an object of a particular statement. They rely on the Web Access Control vocabulary 45 to describe the access privilege to the data: either Read, Write or both. A privacy preference contains properties defining: (1) which resource, statement or graph to restrict access; (2) the type of restriction; (3) the access control type; and (4) a SPARQL query containing a graph pattern representing what must be satisfied by the user requesting information. Figure 17 - The Privacy Preference Manager [8]

59 DataLift: From Raw Publish Data To Interlinked Semantic Data One way to use this ontology, as claimed by the authors [8], is to define a personal Privacy Preference Manager (PPM), providing users with means to specify the preferences based on their FOAF profile. Figure 17 illustrates the related concept: (1) a requester authenticates to the other user s PPM using the WebID protocol; (2) the privacy preferences are queried to identify which preference applies; (3) the preferences are matched according to the requesters profile to test what the requester can access; (4) the requested information (in this case, FOAF data) is retrieved based on what can be accessed; and (5) the requester is provided with the data she/he can access. The privacy manager is expected not to be limited to only data described in FOAF, but to any RDF data since PPO is ontology-agnostic. 8/08/

60 Serena Villata Fabien Gandon 8/08/2011

61 DataLift: From Raw Publish Data To Interlinked Semantic Data Discussion and Perspective In this report, we have discussed two relevant aspects of security in the Semantic Web: data provenance, and data licensing. The former provides instruments to define, and then verify the identity of the information provider. Identifying the provenance of the information in the Web of Data is particularly relevant in order to assess how much the source is considered trustworthy. The latter, instead, provides instruments to define the terms under which the data is released, and which are the terms of reuse. We have presented several approaches that have pros and cons depending on the application domain. A summary of these approaches and their main features is presented in the following table. Dublin Core SWP OPM FOAF SWAN Provenir Provenance Data Catalog Changeset agent dct:agent swp:authority opm:agent foaf:agent provenir:agent prv:actor process dcmitype:event swp:signaturemethod opm:process provenir:process prv:execution, DataAccess, DataCreation 8/08/

62 artifact dct:collection, BibliographicResource, PhysicalResource rdfg:graph opm:artifact foaf:document swancit:citation provenir:data prv:artifact, DataItem, File dcat:catalog rdf:statement account dct:provenancestatement swp:warrant opm:account foaf:account provenir:data derivedfrom dct:replaces, source, refrences opm:wasderivedfrom provenir:derives_from prv:precededby used dcterms:requires swp:quotedby opm:used provenir:has_participant prv:employedartifact, useddata cs:statement, removal, addition 62

63 DataLift: From Raw Publish Data To Interlinked Semantic Data generatedby dct:source swp:assertedby opm:wasgeneratedby foaf:maker swancit:controbutionauthor provenir:has_participant prv:createdby,yeldedby, retrievedby dct:publisher controlledby dct:contributor opm:wascontrolledby foaf:maker swancit:controbutio nauthor provenir:has_agent prv:performedby, involvedactor cs:creatorname triggeredby dct:source opm:wastriggeredby foaf:maker provenir:preceded_by temporalvalue swp:validfrom, validuntil opm:wasperformedat provenir:has_temporal_value dct:modified Location provenir: located_in 8/08/

64 The table presents a mapping among the concepts used in the vocabularies, presented in Section 2, concerning agents, processes, artifacts, and a selection of other concepts. This table makes use of the dimensions identified as crucial by the Provenance Working Group. Among others, we can find the following dimensions: object, attribution, process, versioning, justification, entailment, and trust. In particular, the concept of trust is not precisely defined in any of the vocabularies we consider in Section 2. Trust is a concept with a high complexity, which needs to be captured in the context of vocabularies for defining the provenance of the information. Distinguishing the different dimensions of trust, i.e., concerning different domains, would allow to provide fine-grained descriptions, and thus evaluations of the data provenance. Also the use of provenance for the justification of, for instance, policies, is not addressed directly by any of the above vocabularies. The main part of the vocabularies deals with the artifact, which can be of different type depending on the ontology, e.g., foaf:documents, provenir:data, rdfs:statement, and with generatedby and controlledby properties, where different kinds of actors are involved in the generation and control process of the data, e.g., dct:contributor, cs:creatorname. Concepts related to time and place are not much exploited in these vocabularies, with the exception of the Provenir ontology where place, e.g., provenir:located_in, and time, e.g., provenir:has_temporal_value, are defined. The OPM, instead, considers time using the property opm:wasperformedat, but it does not consider the dimension of versioning, and the concept of digital signature, instead considered by the SWP vocabulary. Concerning data licensing, in the table above we provide a mapping among the concepts used in the various languages presented in Section 2. ccrel MPEG-21 REL Conditions of release cc:reproduction cc:distribution cc:derivativeworks cc:commercialuse Terms, Conditions, Obligations Rights cc:permits cc:prohibits Issue, Obtain Revoke, Law cc:legalcode cc:jurisdiction Waiver Declaration Norms, Waivers OMV omv:licensemodel 64

65 DataLift: From Raw Publish Data To Interlinked Semantic Data DublinCore DOAP doap:license dc:licensedocument However, they represent only part of the whole set of licenses used in the Web. For instance, French national institutes like the Institut Geographique National (IGN) 46 and the Institut national de la statistique et des etudes economiques (INSEE) 47 define their own terms of reuse. The IGN proposes various licenses 48 depending on the kind of data consumer who wants to use their data, e.g., they enable research centres or educational institutions free access to data from the IGN, but it does not include the use of data as part of activities of services, including continuing education, or publishing activities of educational materials distributed through traditional networks of sale, or distribution of educational content on open and publicly accessible web sites. The publications and data provided by INSEE are available for free download, unless otherwise specified, and they can be reused, including for commercial purposes, without license or royalty payments other than those collected by the distribution societies of copyright governed by the Code of Intellectual Property reuse 49 [56]. All these different examples of data licensing show that there is a lack of a uniform approach in the Web. This state of the art shows the necessity of the development of best practices about the use of the presented licensing expression languages. Figure 18 visualizes the results of a search on Watson 50 of the licensing terms we have presented in Section 2. The results show that the Creative Commons Attribution license is the most used one among the Creative Commons licenses, followed by the ShareAlike, DerivativeWorks, and CommercialUse ones. The other well diffused way to express licenses uses the Dublin Core license property. These results make even clearer the lack of a uniform approach to data licensing /08/

66 Figure 18 - Search results with the Watson engine. A first point to underline is that providing tools for allowing the data providers to define their terms of reuse is crucial for the development of the Web of Data. This is true both for the institutions that want to make their data available, e.g., geographical information, or government information, and for the single data providers who want to protect their personal data from undesired accesses, and reuses. A second point to be discussed is the difference among licence and waiver. As defined by Heath and Bizer [50], licenses and waivers represent two sides of the same coin: licenses grant others rights to reuse something, and generally attach conditions to this reuse, while waivers enable the owners to explicitly waive their rights to something, as a dataset. For instance, the legal base for Creative Commons licenses is copyright, which is applicable to creative works. This means that CC licenses are inapplicable to factual data, e.g., geo-coordinates. In this case, the data provider can (and should) define its own terms of reuse by means of the Waiver vocabulary. A third issue is that the licences express the terms under which the data is released, and how it can be reused. There are no evidences supporting the fact that the data is then effectively reused according to the terms specified by the licences. Given that we are talking about the Open Linked Data, it has to be discussed whether solutions e.g., honeypots, would be desirable in such a kind of application in order to monitor the actions the users perform with the data. 66

The Semantic Web DEFINITIONS & APPLICATIONS

The Semantic Web DEFINITIONS & APPLICATIONS Data on the Web There are more an more data on the Web Government data, health related data, general knowledge, company information, flight information, restaurants,