An Archiving System for Managing Evolution in the Data Web

Similar documents
Technical Report. A Query Language for Multi-version Data Web Archives

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

The EvoGen Benchmark Suite for Evolving RDF Data

An overview of RDB2RDF techniques and tools

SPBv: Benchmarking Linked Data Archiving Systems

Porting Social Media Contributions with SIOC

FAGI-gis: A tool for fusing geospatial RDF data

Development of an Ontology-Based Portal for Digital Archive Services

NOMAD Metadata for all

Semantic Web Fundamentals

Provenance-aware Faceted Search in Drupal

Accessing information about Linked Data vocabularies with vocab.cc

Browsing the Semantic Web

Hyperdata: Update APIs for RDF Data Sources (Vision Paper)

From Online Community Data to RDF

D4.6 Data Value Chain Database v2

Benchmarking RDF Production Tools

The CASPAR Finding Aids

Comparative Study of RDB to RDF Mapping using D2RQ and R2RML Mapping Languages

LinDA: A Service Infrastructure for Linked Data Analysis and Provision of Data Statistics

University of Bath. Publication date: Document Version Publisher's PDF, also known as Version of record. Link to publication

Machines that test Software like Humans

MarkLogic 8 Overview of Key Features COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Annotation Component in KiWi

PROJECT PERIODIC REPORT

An RDF NetAPI. Andy Seaborne. Hewlett-Packard Laboratories, Bristol

The MEG Metadata Schemas Registry Schemas and Ontologies: building a Semantic Infrastructure for GRIDs and digital libraries Edinburgh, 16 May 2003

Orchestrating Music Queries via the Semantic Web

Description of CORE Implementation in Java

Data Citation Technical Issues - Identification

OOI CyberInfrastructure Architecture & Design

Griddable.io architecture

Using ESML in a Semantic Web Approach for Improved Earth Science Data Usability

Financial Dataspaces: Challenges, Approaches and Trends

Modern Trends in Semantic Web

DBpedia-An Advancement Towards Content Extraction From Wikipedia

APPLICATION OF SEMANTIC INTEGRATION METHODS FOR CROSS-AGENCY INFORMATION SHARING IN HEALTHCARE

Europeana update: aspects of the data

A Community-Driven Approach to Development of an Ontology-Based Application Management Framework

An aggregation system for cultural heritage content

SEXTANT 1. Purpose of the Application

Semantic Extensions to Defuddle: Inserting GRDDL into XML

What you have learned so far. Interoperability. Ontology heterogeneity. Being serious about the semantic web

Programming the Semantic Web

foreword to the first edition preface xxi acknowledgments xxiii about this book xxv about the cover illustration

UnifiedViews: An ETL Framework for Sustainable RDF Data Processing

Serving Ireland s Geospatial as Linked Data on the Web

Chapter 4. Fundamental Concepts and Models

DELTA-LD: A Change Detection Approach for Linked Datasets

Provenance-Aware Faceted Search in Drupal

Opus: University of Bath Online Publication Store

Extracting knowledge from Ontology using Jena for Semantic Web

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Aspect-oriented Data. Mika Mannermaa Ora Lassila. Nokia Services. Invited

GoNTogle: A Tool for Semantic Annotation and Search

Visual Model Editor for Supporting Collaborative Semantic Modeling

Web Ontology for Software Package Management

a paradigm for the Introduction to Semantic Web Semantic Web Angelica Lo Duca IIT-CNR Linked Open Data:

Combining Government and Linked Open Data in Emergency Management

ECLIPSE PERSISTENCE PLATFORM (ECLIPSELINK) FAQ

Exploiting the Social and Semantic Web for guided Web Archiving

Semantic Web Fundamentals

Semantic Web in a Constrained Environment

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

The European Commission s science and knowledge service. Joint Research Centre

Semantic Annotation and Linking of Medical Educational Resources

Updates through Views

Content Management for the Defense Intelligence Enterprise

WWW, REST, and Web Services

Chapter 4 Research Prototype

Towards an Integrated Information Framework for Service Technicians

AC : EXPLORATION OF JAVA PERSISTENCE

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Linked Data: What Now? Maine Library Association 2017

Harvesting RDF triples

The OASIS Applications Semantic (Inter-) Connection Framework Dionisis Kehagias, CERTH/ITI

Semantic Document Architecture for Desktop Data Integration and Management

Graaasp: a Web 2.0 Research Platform for Contextual Recommendation with Aggregated Data

Information Workbench

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

COMPUTER AND INFORMATION SCIENCE JENA DB. Group Abhishek Kumar Harshvardhan Singh Abhisek Mohanty Suhas Tumkur Chandrashekhara

A Knowledge Model Driven Solution for Web-Based Telemedicine Applications

LODatio: A Schema-Based Retrieval System forlinkedopendataatweb-scale

GMA-PSMH: A Semantic Metadata Publish-Harvest Protocol for Dynamic Metadata Management Under Grid Environment

Collaborative Editing and Linking of Astronomy Vocabularies Using Semantic Mediawiki

Metadata Issues in Long-term Management of Data and Metadata

1. PUBLISHABLE SUMMARY

FINANCIAL REGULATORY REPORTING ACROSS AN EVOLVING SCHEMA

Durchblick - A Conference Assistance System for Augmented Reality Devices

A Comparative Analysis of Linked Data Tools to Support Architectural Knowledge

LEARNING OBJECT METADATA IN A WEB-BASED LEARNING ENVIRONMENT

Knowledge Base Evolution Analysis: A Case Study in the Tourism Domain

Application of the Peer-to-Peer Paradigm in Digital Libraries

DB2 NoSQL Graph Store

Information Retrieval System Based on Context-aware in Internet of Things. Ma Junhong 1, a *

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

SQTime: Time-enhanced Social Search Querying

From Open Data to Data- Intensive Science through CERIF

Ontology-based Architecture Documentation Approach

Annotation for the Semantic Web During Website Development

Transcription:

An Archiving System for Managing Evolution in the Web Marios Meimaris *, George Papastefanatos and Christos Pateritsas * Institute for the Management of Information Systems, Research Center Athena, Greece {m.meimaris, gpapas, pater}@imis.athena-innovation.gr Abstract. The rising Web has brought forth the requirement to treat information as dynamically evolving aggregations of data from remote and heterogeneous sources, creating the need for intelligent management of the changedriven aspects of the underlying evolving entities. sets change in multiple levels, such as evolving semantics, as well as structural characteristics, such as their model and format. In this paper, we present an archiving system that is driven by the need to treat evolution in a unified way, taking into account change management, provenance, temporality, and querying, and we implement the archive based on a data model and query language designed specifically for addressing preservation and evolution management in the Web. Keywords: Evolution, Change Management, Linked Dynamics 1 Introduction The rising Web treats information as more than a collection of linked documents, but rather as a dynamic aggregation of data collections from remote and heterogeneous resources. The Linked paradigm [2] advocates appropriate standards and recommendations in order to promote best practices in the creation, publication and retrieval of data with the use of standards such as RDF 1 and SPARQL 2. The dynamic nature of web communities and the heterogeneity that comes with the data attribute temporality to the Web, which in turn can introduce changes in unpredictable ways, which can in turn lead to information and quality loss, as well as trust issues [7]. In this regard, the benefits of managing evolution are placed into two categories: (i) quality/trust and (ii) maintenance/preservation. In the former lie issues such as broken links or URI changes, while in the latter, data exploitation over historic data provides new insights, and therefore added value, on dataset dynamics and their relat- Supported by the European Union (European Social Fund - ESF) and Greek national funds through the Operational Program "Education and Lifelong Learning" of the National Strategic Reference Framework (NSRF) - Research Funding Program: Thales. Investing in knowledge society through the European Social Fund. * Supported by the EU-funded ICT project "DIACHRON" (agreement no 601043). 1 http://www.w3.org/rdf/ 2 http://www.w3.org/tr/rdf-sparql-query/ 1

ed actors. However, evolution management is a multidimensional problem that involves versioning, temporal representation, change representation and management, provenance and temporal querying [1,9]. Our contribution is an archiving system that tackles these issues in a unified manner, by employing a data model and query language specifically designed to address the diverse characteristics of evolving Linked. Our resulting approach is a scalable archiving system that enables multiversioning, metadata annotations in different levels of granularity, evolution management from single records and entities to datasets as wholes, a common representation layer that offers abstraction from source model as well as time in order to represent time-agnostic entities (datasets, linked data resources and so on) from different source models (e.g. ontological, relational etc.). Related Work. In [10,11] the authors extend HTTP with a temporal dimension for accessing past versions of web documents and LD resources. In [12] the authors address multi-versioning for XML documents by using deltas between sequential versions. [3] propose XML archiving with top-down temporality. Semantic and structural differences between versions of RDFS graphs are addressed in [13]. 2 Preliminaries We use the data model presented in [4,5] as the basis for the archive system. The data model represents datasets from different source models and formats, namely relational, ontological and multidimensional, by offering a common representation level that treats information in a dual manner that involves representation of information-rich entities along with their semantics, as well as representation of structural aspects of the data, such as the break-down of datasets into schema elements, records and record attributes, in order to enable evolution management and metadata annotations at different levels of granularity. Changes are classified as simple or complex and are stored alongside their datasets. All types of entities (datasets, changes and so on) are represented in the same domain model, thus enabling them to be directly correlated and combined in queries that span across datasets and time. Time-agnostic representations of entities are differentiated from time-aware (i.e. particular versions) entities. Based on this model, a query language [6] has been designed and implemented for intuitive querying of evolving datasets. This enables combining semantics, structural characteristics and changes in the same queries across time and datasets. A query can be comprised of sub-queries that define particular datasets, versions, changes, temporal ranges, as well as variable bindings for particular changes, records and/or attributes. The query language has been implemented as an extension of SPARQL. 3 Architecture and Implementation The architecture and various components of the archiving system are depicted in figure 1. The main storage component is implemented on top of a Virtuoso 7.1 3 instance. The archive s web service interface is exposed via HTTP as the primary access point 3 https://github.com/openlink/virtuoso-opensource

through a RESTful API. This way programmatic access and integration with more complex architectures is enabled. A web based GUI is implemented in order to offer browser access to uploading and querying of the archive s data. The main components are described in the following. Query Processor. The base mechanism for data access and query planning and execution is provided by this component. The Query Parser and Validator module validates query syntax as defined in [6]. The parser parses the query string and maps it to corresponding Java objects. The Query Translator creates execution plans for queries by translating to the native query language of the persistence store, i.e. SPARQL. The created plans are based on archive structures implemented in the persistence store and the appropriate indexes and dictionaries. Finally, the Query Executor module is responsible for the execution of the created plan and builds the result set of the query. Admin UI Web UI HTTP API Admin API API Query Processor Modeler Administration Archive Administration Parser Query Translator Validator Executor Access Manager Translator Loader Archive Optimizer Processing Statistics Manager Query Logger Analyzer Transaction Manager Cache Batch Loader Store Connector Quad Store Figure 1. Architecture of the archive. Modeler. This component handles creation and conversion of datasets from and to the diachronic data model defined in [4,5] to their source data models and formats/serializations. The Translator module handles dataset conversion to the archive s internal storage structures, and is also responsible for named graph management, URI minting and dictionary administration. The Loader module serves as the input interface for dataset insertion, and supports both bulk loading and single insertions of datasets. Furthermore, it links new dataset versions to their corresponding diachronic datasets. Finally, the Archive Optimizer module is designed to enable optimization of the storage method based on pre-defined archive strategies [8]. Access Manager. This component provides low level data management functionality for the archive. It decouples the archive s functionality from its underlying store. Its sub-component, the Transaction Manager module, provides low level trans- 3

action handling (commit, rollback) and also handles the store connection. The Cache module acts as temporary in memory storage of archive structures that are frequently accessed. The Batch Loader offers alternative ways of efficient loading to the store. Finally, the Statistics Manager module collects and analyzes datasets and querying workload so as to populate statistical information about them. This information can be used for query optimization by the Query Translator. Archive Administration. This component is a set of management and configuration tools of the archive that provide functionality such as manual configuration of archiving strategies, as well as user management. Moreover this component can act as an abstraction layer for common management tasks (e.g. start/stop data store service) in order to decouple the Admin API and UI from the store technology and/or vendor. Store Connector. The Store Connector is the software package that provides an API to other components of the archiving module for communication and data exchange with the underlying store. It is usually provided by the store s vendor and is dependent to a specific programming language, for example the Virtuoso JDBC Driver package 4. Store. This is the persistence mechanism of the archive. It employs a generic native RDF store with named graph support, where any store implementation is pluggable. 4 Implementation and Evaluation We have implemented all components in Java; we have used the Spring 5 framework for developing the HTTP API services and the API, and we extended the SPARQL 1.1 grammar parser and query planner based on Jena ARQ 6. We evaluated the archive system in two different data contexts, drawn from the life sciences (ontological) and the government statistics (multidimensional) domains. The archive s performance was assessed in three main dimensions: (i) loading times, (ii) retrieval and conversion times (de-reification and serialization), and (iii) query execution times [6]. Our tests indicate that as far as query and load times are concerned, the system scales linearly w.r.t. the storage s dataset input size. 5 Conclusions The increasing volume and dynamicity of datasets in the Web has created new issues related to the decentralized nature and evolution of Linked. As a consequence, this has introduced requirements for efficient solutions to such problems. In this paper, we have presented an archiving system that tackles evolution management of datasets in the Web. The archiving system is based on a data model and query language designed to directly and natively address the issues of data preservation, change detection, change representation, provenance and metadata annotations at 4 http://docs.openlinksw.com/virtuoso/virtuosodriverjdbc.html 5 http://projects.spring.io/spring-framework/ 6 http://jena.apache.org/documentation/query/

different granularity levels, as well as source heterogeneity. We have implemented and tested the archive in the life sciences and government statistics domains and we have found it to scale linearly to increase in the input size as far as querying and loading are concerned. As future work, we intend to fully encompass the dynamic archiving strategies described in [8], and address longitudinal querying and data retrieval efficiency in settings of hybrid dataset storage that includes versions stored as both deltas and full materializations. 6 References 1. Auer, S. et al. "Diachronic linked data: towards long-term preservation of structured interrelated information." InProceedings of the First International Workshop on Open, pp. 31-39. ACM (2012). 2. Bizer, C. et al. Linked The Story So Far. Special. Issue on Linked, In International Journal on Semantic Web and Information Systems (2009) 3. Buneman, P. et al. "Archiving scientific data." ACM Transactions on base Systems (TODS) 29, no. 1 (2004) 4. Meimaris, M. et al. "Towards a Framework for Managing Evolving Information Resources on the Web." In Proceedings of Profiles Workshop of ESWC, vol. 14. (2014) 5. Meimaris, M. et al. "A Framework for Managing Evolving Information Resources on the Web, arxiv preprint arxiv:1504.06451 (2015) 6. Meimaris, M. et al. "A Query Language for Multi-version Web Archives", arxiv preprint arxiv:1504.01891 (2015) 7. Papastefanatos, G., and Stavrakas, Y.. "Diachronic Linked : Capturing the Evolution of Structured Interrelated Information on the Web."ERCIM News 2014, no. 96 (2014). 8. Stefanidis, K. et al. "On designing archiving policies for evolving RDF datasets on the Web." In Conceptual Modeling, pp. 43-56. Springer International Publishing (2014). 9. Umbrich, J. et al. "set dynamics compendium: A comparative study." (2010). 10. Van de Sompel, H. et al. "An HTTP-based versioning mechanism for linked data." arxiv preprint arxiv:1003.3661 (2010). 11. Van de Sompel, H. et al. "Memento: Time travel for the web." arxiv preprint arxiv:0911.1112 (2009). 12. Raymond W. and Lam, N. "Managing and querying multi-version XML data with update logging." In Proceedings of the 2002 ACM symposium on Document engineering, pp. 74-81. ACM (2002). 13. Völkel, M., and Groza, T. "SemVersion: An RDF-based ontology versioning system." In Proceedings of the IADIS international conference WWW/Internet, vol. 2006, p. 44 (2006) 5