TELEIOS FP Deliverable D4.3. The evaluation of the developed implementation

Size: px

Start display at page:

Download "TELEIOS FP Deliverable D4.3. The evaluation of the developed implementation"

Jane Burns
6 years ago
Views:

1 TELEIOS FP Deliverable D4.3 The evaluation of the developed implementation Kostis Kyzirakos, George Garbis, Charalampos Nikolaou, Konstantina Bereta, Stella Giannakopoulou, Kallirroi Dogani, Panayiotis Smeros, Manolis Koubarakis, Manos Karpathiotakis, Eva Klien, Konstantinos Kravaritis, Stavros Vassos, Ioannis Smaragdakis, Dimitris Bakogiannis, Ioannis Foufoulas and Consortium Members October 17, 2013 Status: Final Scheduled Delivery Date: 31 August, 2013

2 TELEIOS FP Executive Summary In this deliverable we present the benchmarking and evaluation of the components that have been developed in WP4 and reported in the deliverables D4.1 and D4.2. This document also includes the reporting of additional components that have been developed and are related to the project, although the development of these components was not initially included in the Description of Work. The deliverable is divided into two parts. The first part focuses on the evaluation of the developed components and the second part documents the additional tools that have been developed. In the first part of the deliverable, we present the evaluation of the geospatial features of Strabon. We present a comparison of RDF stores in terms of the functionality that they offer and then a performance evaluation of RDF stores that conform to GeoSPARQL. Next, we present the evaluation of the temporal features of Strabon and a comparison of Strabon with systems that offer similar temporal functionalities. In the second part of the deliverable, we provide the documentation of the following tools the development of which was not included the Description of Work and has been completed during the third year of the project. In this part, we first present the Query Builder, which is a component that can be used for constructing, saving and sharing stsparql queries through a graphical user interface. Second, we present Sextant, which is a web-based tool that enables exploration of linked geospatial data as well as creation, sharing, and collaborative editing of thematic maps by combining linked geospatial data and other geospatial information available in standard OGC file formats. Last, the temporally enabled version of Sextant, which we have renamed SexTant follows (the capitalized T is used to emphasize the temporal dimension) which extends the previous version with the capability of visualizing geospatial features with temporal dimension on both a map and a timeline. This tool exploits all the spatial and temporal features of the data model strdf and the query language stsparql, which have been implemented in Strabon and reported in detail in the deliverables D4.1 and D4.2. Because the work that has been performed in the WP4 has truly advanced the state of the art in the respective research topics, it has led to a significant number of publications that cover almost all the topics described above. Therefore we chose to organize the deliverable in such a way that, we provide the publications are given verbatim, with additional discussion only when necessary. D4.3 The evaluation of the developed implementation i

3 TELEIOS FP Document Information Contract Number FP Acronym TELEIOS Full title TELEIOS: Virtual Observatory Infrastructure for Earth Observation Data Project URL EU Project Officer Francesco Barbato Deliverable Number D4.3 Name The evaluation of the developed implementation Task Number T4.3 Name Benchmarking and evaluation of the developed system Work package Number WP4 Date of delivery Contract M36 (Aug. 2013) Actual 31 August 2013 Status Draft Final Nature Prototype Report Distribution Type Public Restricted Responsible Partner NKUA QA Partner ACS Contact Person Manolis Koubarakis Phone Fax D4.3 The evaluation of the developed implementation ii

TELEIOS FP7-257662 Project Information This document is part of a research project funded by the IST Programme of the Commission of the European Communities as project number FP7-257662.

Manolis Koubarakis of Athens National and Kapodistrian University of Department of Informatics and Athens Telecommunications Department of Informatics and Telecommunications Panepistimiopolis,

gr) Tel: +30 210 7275213, Fax: +30 210 7275214 Fraunhofer Institute for Computer Graphic Research German Aerospace Center The Remote Sensing Technology Institute Photogrammetry and Image Analysis

de) Tel: +49 6151 155 412, Fax: +49 6151 155 444 Prof. Mihai Datcu German Aerospace Center The Remote Sensing Technology Institute Oberpfaffenhofen, D-82234 Wessling, Germany Email: (mihai.datcu@dlr.

Sensing CWI NOA Prof. Martin Kersten Stichting Centrum voor Wiskunde en Informatica P.O. Box 94097, NL-1090 GB Amsterdam, Netherlands Email: (martin.kersten@cwi.

4 TELEIOS FP Project Information This document is part of a research project funded by the IST Programme of the Commission of the European Communities as project number FP The Beneficiaries in this project are: Partner Acronym Contact National and Kapodistrian University NKUA Prof. Manolis Koubarakis of Athens National and Kapodistrian University of Department of Informatics and Athens Telecommunications Department of Informatics and Telecommunications Panepistimiopolis, Ilissia, GR Athens, Greece (koubarak@di.uoa.gr) Tel: , Fax: Fraunhofer Institute for Computer Graphic Research German Aerospace Center The Remote Sensing Technology Institute Photogrammetry and Image Analysis Department Image Analysis Team Fraunhofer DLR Dr. Eva Klien Fraunhofer Institute for Computer Graphic Research Fraunhofer Strasse 5, D Darmstadt, Germany (eva.klien@igd.fraunhofer.de) Tel: , Fax: Prof. Mihai Datcu German Aerospace Center The Remote Sensing Technology Institute Oberpfaffenhofen, D Wessling, Germany (mihai.datcu@dlr.de) Tel: , Fax: Stichting Centrum voor Wiskunde en Informatica Database Architecture Group National Observatory of Athens Institute for Space Applications and Remote Sensing CWI NOA Prof. Martin Kersten Stichting Centrum voor Wiskunde en Informatica P.O. Box 94097, NL-1090 GB Amsterdam, Netherlands (martin.kersten@cwi.nl) Tel: , Fax: Dr. Charis Kontoes National Observatory of Athens Institute for Space Applications and Remote Sensing Vas. Pavlou and I. Metaxa, GR Athens, Greece (kontoes@space.noa.gr) Tel: , Fax: Advanced Computer Systems A.C.S S.p.A ACS Mr. Ugo Di Giammatteo Advanced Computer Systems A.C.S S.p.A Via Della Bufalotta 378, RM Rome, Italy (udig@acsys.it) Tel: , Fax: D4.3 The evaluation of the developed implementation iii

5 TELEIOS FP Contents 1 Introduction 1 2 Evaluation of the geospatial features of Strabon Geographica: A Benchmark for Geospatial RDF Stores Functional Comparison of Geospatial RDF Stores Performance Evaluation of Geospatial RDF Stores The Strabon evaluation by the GeoKnow project Evaluation of the temporal features of Strabon 28 4 Additional Implementations The Visual Query Builder Functionality Overview Architecture and Design Query Builder UI Query Builder Server Future work Conclusions 60 D4.3 The evaluation of the developed implementation iv

6 TELEIOS FP List of Figures 2.1 Execution times of different version/configurations of Strabon per query family Facet-graph-based editor (left) and text-based editor (right) Visual Query Builder system overview Technical framework for the Query Builder Overall view on the functionalities of the Query Builder Screenshot of the Query Builder UI showing a Facet-Graph based query representation for the DLR Use Case Screenshot of the Query Builder UI showing the stsparql equivalent of the Facet- Graph based query representation for the DLR Use Case in the text based Query Editor D4.3 The evaluation of the developed implementation v

7 TELEIOS FP List of Tables 2.1 Functionality of Geospatial RDF Stores Geospatial RDF stores evaluated in [ABG + 13] Spatially-enabled RDBMS evaluated in [ABG + 13] Execution of experiments using datasets and queries defined in [ABG + 13] w.r.t. different configurations D4.3 The evaluation of the developed implementation vi

8 TELEIOS FP Introduction In this deliverable we present the benchmarking and evaluation of the components that have been developed in the context of WP4 and reported in the deliverables D4.1 [GBK + 12] and D4.2 [BDG + 13]. This document also includes the reporting of additional components that have been developed and are related to the project although the development of these components was not initially included in the Description of Work. The deliverable is divided into two parts. The first part focuses on the evaluation of the developed components and the second part documents the additional tools that have been developed. In the first part of the deliverable we describe the evaluation that has been performed on the following components that have been developed in the context of WP4: Evaluation of the geospatial features of Strabon. The experimental evaluation of the geospatial features of Strabon, as they have been reported in the deliverables D4.1 [GBK + 12] and D4.2 [BDG + 13] that precede this deliverable, has been documented in more detail in the research paper [KKK12b] that shows that Strabon often performs better than any other system it has competed with. As promised in Task 4.3 of WP4, this work was later followed by the design and implementation of a benchmark, called Geographica to be presented at the forthcoming International Semantic Web Conference 2013 [GKK13]. Geographica is composed by two workloads with their associated datasets and queries. The first (real-world) workload is based on publicly available linked datasets. It tests primitive spatial functions and the performance of RDF stores in some typical application scenarios. The second (synthetic) workload evaluates RDF stores in a controlled environment using a generator that produces synthetic datasets of various sizes and generates queries of varying thematic and spatial selectivity. The Geographica benchmark is discussed in Chapter 2 where we provide a functional and performance comparison of RDF stores that takes into account the recent advances to the state of the art in the area of the geospatial RDF stores. Section 2.2 of Chapter 2 reports on the results of an independent evaluation of Strabon that was carried out in the context of the EU project GeoKnow and presented in the deliverable D2.1.1 [ABG + 13]. The results reported in [ABG + 13], regarding the performance of Strabon compared to other geospatial RDF stores, differ significantly from the results that we have measured and presented in publications authored by our group [KKK12b, GKK13]. For this reason, we repeat the benchmark proposed in [ABG + 13], examine, and comment on the reported results regarding the performance of the system Strabon. Evaluation of the temporal features of Strabon. The temporal features of the data model strdf and the query language stsparql and their implementation in Strabon were reported in the deliverable [BDG + 13] and also published in the research paper [BSK13] which was presented at the Extended Semantic Web Conference This work also includes the evaluation of these features and a comparison with software solutions that offer similar functionality and shows that Strabon is very efficient with respect to its temporal functionalities as well. This paper is also provided in Chapter 3. In the second part of the deliverable we report on the following tools the development of which was not included in the Description of Work and has been completed during the third year of the project: The Visual Query Builder. This component provides different kinds of users with the possibility to construct and in the future save and share stsparql queries through a graphical user interface. While users of the query builder are not required to have background knowledge of stsparql, spatial and topological constraints can be arranged in a complex query graph without requiring its users to have detailed knowledge of the stsparql query language. The UI options that are provided for query construction are derived from a meta-data D4.3 The evaluation of the developed implementation 1

9 TELEIOS FP ontology (i.e. the meta-data ontologies developed for the NOA and DLR Use Case) which describes the structure of the data sets to which the constructed query will refer to. The documentation of the Visual Query Builder is provided in Chapter 4. Sextant. Sextant is a web-based tool that enables exploration of linked geospatial data as well as creation, sharing, and collaborative editing of thematic maps by combining linked geospatial data and other geospatial information available in standard OGC file formats. Sextant has been described further in the demo paper [NDKK13] and presented at ESWC 2013, where it was awarded with the best demo award. This publication is provided in Chapter 4. The temporally-enabled version of Sextant, which we have renamed SexTant (the capitalized T is used to emphasize the temporal dimension). This version extends the older one with the ability to visualize also the temporal dimension of time-evolving geospatial data. The demo paper [KNK + 13] contains a more detailed description of SexTant. This publication is also provided in Chapter 4. D4.3 The evaluation of the developed implementation 2

10 TELEIOS FP Evaluation of the geospatial features of Strabon 2.1 Geographica: A Benchmark for Geospatial RDF Stores Functional Comparison of Geospatial RDF Stores In this section we present RDF stores that provide support for geospatial data and we compare them in terms of functionality that they offer. GeoSPARQL, which is an OGC standard for representing and querying geospatial data on the Semantic Web, has not yet been adopted by all RDF stores with geospatial functionality. Thus, we concentrate our comparison on the language that is supported to handle spatial data and the offered spatial relation classes and functions. We also organize the RDF stores in two categories, (i) geospatial RDF stores that conform to GeoSPARQL and (ii) generic RDF stores that provide limited geospatial functionality. A tabular view of this comparison can be found in Table 2.1. Geospatial RDF Stores that conform to the GeoSPARQL standard The system Strabon 1, under development in our group since 2009, is a storage and query evaluation module for strdf/stsparql [KKK10]. Strabon extends the well-known RDF store Sesame, allowing it to manage both thematic and spatial data expressed in strdf and stored in the PostGIS spatially enabled DBMS. The current version of Strabon (Strabon 3.0) fully implements stsparql. stsparql 2 provides the machinery of the OpenGIS Simple Feature Access standard as well as spatial aggregation functions, which have already been adopted by spatially enabled RDBMS but not yet by the most of geospatial RDF stores, other useful spatial functions (e.g., directional relations) and temporal extension functions. In addition, Strabon 3.0 implements fully the GeoSPARQL Core, Geometry extension and Geometry Topology extension components. The implementation of this subset of GeoSPARQL in Strabon was straightforward given the close relationship between stsparql and GeoSPARQL [KKK + 12a]. OpenSahara useekm 3 is an add-on library for semantic repositories utilizing the Sesame Java interface. useekm initially focused on providing full text search functionality. Newer versions have been enhanced with geospatial capabilities and now useekm supports the most part of GeoSPARQL. Specifically, useekm supports the GeoSPARQL Core, Topology Vocabulary extension, Geometry extension, Geometry Topology extension and RDFS Entailment extension components. useekm supports only WKT serialization for Geometries and implements all three relation family classes (Simple Feature, Egenhofer, RCC8). Regarding reference systems only WGS84 is supported. Finally useekm implements some extension functions (not defined by GeoSPARQL), such as area and shortestline which compute the area of a geometry and the shortest line between two geometries accordingly. For spatial indexing useekm utilizes a PostGIS database to create an R-Tree-over- GiST. The optimizer will check whether the query evaluation will benefit from the extra index and if so, parts of the query are answered by PostGIS using the R-Tree and parts are answered by the original RDF database. The RDF store Parliament 4 developed by Rob Battle and Dave Kolas at Raytheon BBN Technologies fully implements most of the functionality of GeoSPARQL except the Query Rewriting extension which has been promised for a forthcoming version [BK11]. Both WKT and GML serialization is supported by Parliament as well as multiple CRS and all three topological relation family D4.3 The evaluation of the developed implementation 3

11 TELEIOS FP System Language Index Geometries CRS support Comments stsparql/ R-tree-over WKT/GML OGC-SFA Strabon Yes Egenhofer GeoSPARQL -GiST support RCC-8 R-tree-over WKT OGC-SFA useekm GeoSPARQL support No Egenhofer -GiST RCC8 WKT/GML Yes OGC-SFA Parliament GeoSPARQL R-Tree Egenhofer support RCC-8 Oracle 12c GeoSPARQL R-Tree WKT Yes OGC-SFA Virtuoso SPARQL R-Tree 2D point SQL/MM Partial geometries (in WKT) (subset) Extended Distribution sweeping 2D point Buffer AllegroGraph Partial Bounding Box SPARQL technique geometries Distance Extended 2D point geometries Point in polygon OWLIM Custom (W3C Basic Geo No Buffer SPARQL Vocabulary) Distance Table 2.1: Functionality of Geospatial RDF Stores classes. Parliament utilizes a spatial index based on a standard R-tree implementation. Unlike Strabon and useekm, which detect spatial object by the used datatype, Parliament is based on the predicates. The RDF graph is scanned for triples that contain geo:aswkt or geo:asgml predicates and for each matching triple Parliament creates a record for the geometry that is represented in the object of the triple and inserts it into the index. The query optimizer tries to split SPARQL queries into multiple parts and produce an optimized query plan between the spatial and thematic components of the query. Current version of Parliament (v2.7.4) concentrates on optimizing query patterns (using the Topology Vocabulary extension of GeoSPARQL) while it omits optimization for filter functions. The well-known Oracle DBMS also offers support for representing and querying geospatial data in RDF through its Semantic Technologies product suite (version 12c, Release 1) 5. From this version onwards Oracle DBMS conforms to GeoSPARQL. Specifically it supports the Core, Topology Vocabulary extension, Geometry extension, Geometry Topology extension, and RDFS Entailment extension components. Various CRS are supported but only the WKT geometry serialization. Oracle DBMS offers SPARQL extension functions defined by the Simple Features relation family class. Other SPARQL extension functions such as nearestneighbor and centroid are offered as well. Oracle DBMS uses an R-Tree to index spatial data. When creating a spatial index a user should define the CRS which will be used to create the spatial index, the minimum and maximum value for each dimension of data and a tolerance value. Tolerance is a positive number indicating how close together two points must be to be considered the same point. Generic RDF Stores with Limited Geospatial Capabilities OpenLink Virtuoso 6, provides support for the representation and querying of two-dimensional point geometries. Virtuoso allows geometries to be expressed either in WGS84 or a flat twodimensional plane. Virtuoso models geometries by typed literals like stsparql and GeoSPARQL. The datatype < has been introduced for this purpose. The value of such a literal is the WKT serialization of the spatial object s geometry. Virtuoso offers vocabulary for a subset of the ISO SQL/MM standard to perform geospatial queries using SPARQL. The SQL/MM functions supported are realized by this vocabulary which can be used in the SELECT and FILTER clause of a SPARQL query. In the case of FILTER for example, the user can test for relations between two geometries by using SPARQL functions corresponding to the st_intersects, st_contains, and st_within SQL/MM functions. Thus, the query language of Virtuoso makes similar design decisions with stsparql and corresponds only to a part of the functionality envisioned by GeoSPARQL. Note that these functions D4.3 The evaluation of the developed implementation 4

12 TELEIOS FP are extended with a third argument (called precision) which specifies a maximum distance between two points such that the points will still be considered to overlap. Thus, these functions can support buffer queries exploiting the spatial index of Virtuoso. Virtuoso utilizes an R-tree index implemented as a table in the relational database component of Virtuoso. AllegroGraph 7 is one of the first RDF stores that provided support for geospatial data. It can only store geometries that are points in a two-dimensional space and it uses a strip-based index. AllegroGraph exploits typical sort algorithms and retrieval techniques for linear data for searching and retrieving points in two dimensions. Especially, it combines X and Y coordinates into a single construct and it divides the Y range into strips of known width (similar to the expected width of search region). Geospatial data are sorted first on Y strip, then on X coordinate and finally on the Y coordinate. Thus, for a specific search region a linear traverse of the X range is done only in the strips that overlap the search region and an area search is transformed to a small number of linear searches. The drawback of this technique is that the user should know in advance how the data will be used in order to define a proper value for the strip width. Support is provided both for Cartesian coordinate systems and for spherical coordinate systems. For querying, AllegroGraph introduces a GEO operator for expressing geospatial query patterns in SPARQL. GEO has a syntax similar to the GRAPH operator of standard SPARQL. Whenever a geospatial query is posed in AllegroGraph, the bindings of the traditional WHERE clause of the query are evaluated at a first step. These bindings are then shared with the geospatial part of the query defined by the GEO clause and are used to define an area with the use of either a buffer (a variation exists for spherical and Cartesian coordinate systems respectively) or a bounding box operator. The area defined is used to filter the bindings of the GEO operator, returning the ones lying within this area. Results can also be filtered by utilizing AllegroGraph s functions that compute the Cartesian or the Haversine distance of two points, as well as functions returning a point s coordinates in order to perform numeric comparisons between the coordinates of different points. This syntax is neither as flexible as the syntax of GeoSPARQL and stsparql, nor SPARQL compliant. Recently functionality of AllegroGraph has been enhanced by property functions 8 for determining points in a bounding box and points within a circle 9. OWLIM 10 is another semantic repository enhanced with geospatial capabilities. It allows the representation of geometries that are points represented using the W3C Basic Geo vocabulary. OWLIM introduces a series of property functions extending SPARQL to enable the expression of queries over WGS84 point geometries. These functions are restricted to buffer operations, like the Virtuoso and AllegroGraph, and also point-within-polygon operations. Additionally, a SPARQL filter function computing the distance between two points is also available Performance Evaluation of Geospatial RDF Stores Our work in this area is given by the following paper Geographica: A Benchmark for Geospatial RDF Stores which will be presented at ISWC D4.3 The evaluation of the developed implementation 5

13 Geographica: A Benchmark for Geospatial RDF Stores George Garbis, Kostis Kyzirakos, and Manolis Koubarakis National and Kapodistrian University of Athens, Greece {ggarbis,kk,koubarak}@di.uoa.gr Abstract. Geospatial extensions of SPARQL like GeoSPARQL and stsparql have recently been defined and corresponding geospatial RDF stores have been implemented. However, there is no widely used benchmark for evaluating geospatial RDF stores which takes into account recent advances to the state of the art in this area. In this paper, we develop a benchmark, called Geographica, which uses both real-world and synthetic data to test the offered functionality and the performance of some prominent geospatial RDF stores. Keywords: benchmarking, geospatial, RDF store, Linked Open Data, GeoSPARQL, stsparql 1 Introduction The Web of data has recently started being populated with geospatial data and geospatial extensions of SPARQL, like GeoSPARQL and stsparql, have been defined. GeoSPARQL [12] is a recently proposed OGC standard for a SPARQLbased query language for geospatial data expressed in RDF. GeoSPARQL defines a vocabulary (classes, properties, and extension functions) that can be used in RDF graphs and SPARQL queries to represent and query geographic features with vector geometries. stsparql [9,1] is an extension of SPARQL 1.1 developed by our group for representing and querying geospatial data that change over time. Similarly to GeoSPARQL, the geospatial part of stsparql defines datatypes that can be used for representing in RDF the serializations of vector geometries encoded according to the widely adopted OGC standards Well Known Text (WKT) and Geography Markup Language (GML). stsparql and GeoSPARQL define extension functions from the OGC standard OpenGIS Simple Feature Access (OGC-SFA) that can be used by the users for manipulating vector geometries. In parallel with the appearance of GeoSPARQL and stsparql, researchers have implemented geospatial RDF stores that support these SPARQL extensions (our own system Strabon 1, Parliament 2 and useekm 3 ). Typically, this has This work was supported in part by the European Commission project TELEIOS (257662)

14 been done by extending existing RDF stores that had no geospatial functionalities (e.g., Sesame) and by relying in state of the art spatially-enabled RDBMS (e.g., PostGIS) for the storage and querying of geometries. One reason that this approach has been successful is that the relational realization of the OGC-SFA standard has been widely adopted by many RDBMS for storing and manipulating vector geometries. The state of the art in this area is summarized in the survey paper [8]. The above advances to the state of the art in query languages and implemented systems has not so far been matched with much work on the evaluation and benchmarking of implemented geospatial RDF stores. Although there are various benchmarks for spatially-enabled RDBMS [17,13,4,14,15,11], there is only one paper in the literature that proposes a benchmark for geospatial data expressed in RDF [6]. However, since this work has preceded the proposal of GeoSPARQL and stsparql, it does not cover much of the features available in these languages. For example, only point and rectangle geometries are used in the data and only two topological functions and two non-topological functions are considered, while metric spatial functions and spatial aggregates are not discussed. Similarly, only the geospatial RDF store SPAUK, which is a precursor to Parliament, has been evaluated using the benchmark. Finally, [6] uses a synthetic workload only and does not consider any linked geospatial datasets such as the ones that are available in the LOD cloud today. In this paper we go significantly beyond [6] and develop a benchmark, that can be used for the evaluation of the new generation of RDF stores supporting the query languages GeoSPARQL and stsparql. Our benchmark, nick-named Geographica 4, is composed by two workloads with their associated datasets and queries: a real-world workload based on publicly available linked data sets and a synthetic workload. The real-world workload uses publicly available linked geospatial data, covering a wide range of geometry types (e.g., points, lines, polygons). To define this workload, we follow the approach of the benchmark Jackpine [15] and we define a micro benchmark and a macro benchmark. The micro benchmark tests primitive spatial functions. We check the spatial component of a system with queries that use non-topological functions, spatial selections, spatial joins and spatial aggregate functions. In the macro benchmark we test the performance of the selected RDF stores in typical application scenarios like reverse geocoding, map search and browsing, and a real-world use case from the Earth Observation domain. In the second workload of Geographica we use a generator that produces synthetic datasets of various sizes and generates queries of varying thematic and spatial selectivity. In this way, we can perform the evaluation of geospatial RDF stores in a controlled environment. In this part we follow the rationale of earlier papers [14,9,3]. For reasons of reproducibility, both workloads are publicly available 5. 4 Geographica (Greek: Γεωγραφικά) is a 17-volume encyclopedia of geographical knowledge written by the greek geographer, philosopher and historian Strabon (Greek: Στράβων) in 7 BC. ( 5

15 We chose to test the systems Strabon, Parliament and useekm. To the best of our knowledge, these systems are the only ones that currently provide support for a rich subset of GeoSPARQL and stsparql. Other RDF stores like OpenLink Virtuoso, OWLIM and AllegroGraph, allow only the representation of point geometries and provide support for a few geospatial functions [8]. The limited functionality provided by these systems did not allow us to include them in the experiments presented in this paper. A comparison between generic RDF stores with limited geospatial capabilities and geospatial RDF stores are given in the long version of this paper 6. The rest of the paper is organized as follows. Section 2 presents previous related work. The benchmark and its results are described in Sections 3 and 4, respectively and general conclusions and future work are discussed in Section 5. 2 Related Work In this section we discuss the most important benchmarks that are relevant to Geographica. We first present well-known benchmarks for SPARQL query processing, then benchmarks from the area of spatial relational databases and, finally, the only available benchmark for querying linked geospatial data. Benchmarks for SPARQL query processing. Four well-known benchmarks for SPARQL querying are the Lehigh University Benchmark (LUBM) [5], the Berlin SPARQL Benchmark (BSBM) [2], the SP 2 Bench SPARQL Performance Benchmark [16] and the DBpedia SPARQL Benchmark (DBPSB) [10]. LUBM, BSBM and SP 2 Bench create a synthetic dataset based on a use case scenario and define some queries covering a spectrum of SPARQL characteristics. For example, the synthetic dataset of SP 2 Bench resembles the original publications dataset of DBLP while the dataset of LUBM describes the university domain. The creators of DBPSB take a different approach. They propose a benchmark creation methodology based on real-world data and query logs. The proposed methodology is used in [10] to create a benchmark based on DBpedia data and query-logs. Benchmarks for spatial relational databases. One of the first benchmarks for spatial relational databases has been the SEQUOIA benchmark [17] which focuses on Earth Science use cases. In order for its results to be representative of Earth Sciences use cases, SEQUOIA uses real-world data (satellite raster data, point locations of geographic features, land use/land cover polygons and data about drainage networks covering the area of USA) and real-world queries. Its queries cover tasks like data loading, raster data management, filtering based on spatial and non-spatial criteria, spatial joins, and path computations over graphs. The SEQUOIA benchmark has been extended in [13] to evaluate the geospatial DBMS Paradise. Two other well known benchmarks for spatial relational databases which use synthetic vector data are Á La Carte [4] and VESPA[14]. Á La Carte uses a dataset consisting only of rectangles which are generated according to various statistical distributions and it has been used to compare the 6

16 Datasets Size Triples # of Points # of Lines # of Polygons GAG 33MB 4K CLC 401MB 630K K LGD (only ways) 29MB 150K - 12K - GeoNames 45MB 400K 22K - - DBpedia 89MB 430K 8K - - Hotspots 90MB 450K K Table 1: Dataset characteristics performance of different spatial join techniques. VESPA [14] creates a more complex dataset with more geometry types (polygons, lines and points) and it has been used to compare PostgreSQL with Rock & Roll deductive object oriented database. More recently, [15] has defined a more generic benchmark for spatial relational databases, called Jackpine. It includes two kinds of benchmarking, micro and macro. Micro benchmarking tests topological predicates and spatial analysis functions in isolation. Macro benchmarking defines six typical spatial data applications scenarios and tests a number of queries based on them. Benchmarks for geospatial RDF stores. The only published benchmark for querying geospatial data encoded in RDF has been proposed by Kolas [6]. He extends LUBM to include spatial entities and to test the functionality of spatially enabled RDF stores. LUBM queries are extended to cover four primary types of spatial queries, namely spatial location queries, spatial range queries, spatial join queries, nearest neighbor queries. Range queries aim to test cases of various selectivity, while spatial joins aims to test whether the query planner selects a good plan by taking into account the selectivity of the spatial and ontological part of each query. 3 The Benchmark Geographica In this section we present our benchmark in detail. Section 3.1 presents its first part (the real-world workload) while Section 3.2 presents the second part (the synthetic workload). 3.1 Real-World Workload This workload aims at evaluating the efficiency of basic spatial functions that a geospatial RDF store should offer. In addition, this workload includes three typical application scenarios. Datasets. In this section we describe the datasets that we use for the real-world workload. We have datasets that play an important role in the Linked Open Data Cloud, such as the part of DBpedia and GeoNames referring to Greece, despite the fact that their spatial information is limited to points. In addition we have part of the LinkedGeoData 7 (LGD) dataset which has richer geospatial 7

17 information from OpenStreetMap 8 about the road network and rivers of Greece. We also chose to use the Greek Administrative Geography 9 (GAG) and the CORINE Land Use/Land Cover 10 (CLC) dataset for Greece which have complex polygons. The CLC dataset is made available by the European Environmental Agency for the whole Europe and contains data regarding the land cover of European countries. Both of these datasets with information about Greece have been published as linked data by us in the context of the European project TELEIOS 11. Finally, we include a dataset containing polygons that represent wild fire hotspots. This dataset has been produced by the National Observatory of Athens (NOA) in the context of project TELEIOS by processing appropriate satellite images as described in [7]. Each dataset is loaded in a separate named graph so that each query access only the part of the dataset that is needed. Some important characteristics of the datasets used can be found in Table 1. Micro Benchmark. The micro benchmark aims at testing the efficiency of primitive spatial functions in state of the art geospatial RDF stores. Thus, we use simple SPARQL queries which consist of one or two triple patterns and a spatial function. We start by checking simple spatial selections. Next, we test more complex operations such as spatial joins. We test spatial joins using the topological relations defined by stsparql [9] and the Geometry Topology component of GeoSPARQL. Apart from topological relations, we test non-topological functions (e.g., geof:buffer), defined by the Geometry extension of GeoSPARQL, which construct a new geometry object. Additionally, we test the metric function strdf:area which is only defined in stsparql. The aggregate functions strdf:extent, and strdf:union of stsparql are also tested by this benchmark. GeoSPARQL does not define aggregate functions. We include aggregate functions in Geographica since they are present in all geospatial RDBMS, and we found them very useful in EO applications in the context of the project TELEIOS. A short description of queries used in the micro benchmark can be found in Table 2. Macro Benchmark. In the macro benchmark we aim to test the performance of the selected RDF stores in the following typical application scenarios: reverse geocoding, map search and browsing, and two scenarios from the Earth Observation domain. Reverse Geocoding. Reverse geocoding is the process of attributing a readable address or place name to a given point. Thus, in this scenario, we pose SPARQL queries which sort retrieved objects by their distance to the given point and select the first one. Map Search and Browsing. This scenario demonstrates the queries that are typically used in Web-based mapping applications. A user first searches for points

18 Query Operation Description Non-topological construct functions Q1 Boundary Construct the boundary of all polygons of CLC Q2 Envelope Construct the envelope of all polygons of CLC Q3 Convex Hull Construct the convex hull of all polygons of CLC Q4 Buffer Construct the buffer of all points of GeoNames Q5 Buffer Construct the buffer of all lines of LGD Q6 Area Compute the area of all polygons of CLC Spatial selections Q7 Equals Find all lines of LGD that are spatially equal with a given line Q8 Equals Find all polygons of GAG that are spatially equal a given polygon Q9 Intersects Find all lines of LGD that spatially intersect with a given polygon Q10 Intersects Find all polygons of GAG that spatially intersect with a given line Q11 Overlaps Find all polygons of GAG that spatially overlap with a given polygon Q12 Crosses Find all lines of LGD that spatially cross a given line Q13 Within polygon Find all points of GeoNames that are contained in a given polygon Q14 Within buffer Find all points of GeoNames that are contained in the buffer of a given point of a point Q15 Near a point Find all points of GeoNames that are within specific distance from a given point Q16 Disjoint Find all points of GeoNames that are spatially disjoint of a given polygon Q17 Disjoint Find all lines of LGD that are spatially disjoint of a given polygon Spatial joins Q18 Equals Find all points of GeoNames that are spatially equal with a point of DBpedia Q19 Intersects Find all points of GeoNames that spatially intersect a line of LGD Q20 Intersects Find all points of GeoNames that spatially intersect a polygon of GAG Q21 Intersects Find all lines of LGD that spatially intersect a polygon of GAG Q22 Within Find all points of GeoNames that are within a polygon of GAG Q23 Within Find all lines of LGD that are within a polygon of GAG Q24 Within Find all polygons of CLC that are within a polygon of GAG Q25 Crosses Find all lines of LGD that spatially cross a polygon of GAG Q26 Touches Find all polygons of GAG that spatially touch other polygons of GAG Q27 Overlaps Find all polygons of CLC that spatially overlap polygons of GAG Aggregate functions Q28 Extension Construct the extension of all polygons of GAG Q29 Union Construct the union of all polygons of GAG Table 2: Queries of the micro benchmark Query Description Reverse Geocoding RG1 Find the closest populated place (from GeoNames) RG2 Find the closest street (from LGD) Map Search and Browsing MSB1 Find the co-ordinates of a given POI based on thematic criteria (from GeoNames) MSB2 Find roads in a given bounding box around these co-ordinates (from LGD) MSB3 Find other POI in a given bounding box around these co-ordinates (from GeoNames) Rapid Mapping for Fire Monitoring RM1 Find the land cover of areas inside a given bounding box (from CLC) RM2 Find primary roads inside a given bounding box (from LGD) RM3 Find detected hotspots inside a given bounding box (from Hotspots) RM4 Find municipality boundaries inside a given bounding box (from GAG) RM5 Find coniferous forests inside a given bounding box which are on fire (from CLC and Hotspots) RM6 Find road segments inside a given bounding box which may be damaged by fire (from LGD and Hotspots) Table 3: Queries of the macro benchmark

19 of interest based on thematic criteria. Then, she selects a specific point and information about the area around it is retrieved (e.g., POIs and roads). Rapid Mapping for Wild Fire Monitoring. In this scenario we test queries which retrieve map layers for creating a map that can be used by decision makers tasked with the monitoring of wild fires. This application has been studied in detail in project TELEIOS [7] and the scenario covers its core querying needs. First, spatial selections are used to retrieve basic information of interest (e.g., roads, administrative areas etc.). Second more complex information can be derived using spatial joins and non-topological functions. For example, a user may be interested in the segment of roads that may be damaged by fire. We point out that this scenario is representative of many rapid mapping tasks encountered in Earth Observation applications. The queries of the macro benchmark can be found in Table Synthetic Workload The synthetic workload of Geographica relies on a generator that produces synthetic datasets of various sizes and instantiates query templates that can produce queries with varying thematic and spatial selectivity. In this way, we can perform the evaluation of geospatial RDF stores in a controlled environment in order to monitor their performance with great precision. Datasets. The workload generator produces synthetic datasets of arbitrary size that resemble features on a map. As in VESPA [14], the produced datasets model the following geographic features: states in a country, land ownership, roads and points of interest. For each dataset, we developed a minimal ontology 12 that follows a general version of the schema of OpenStreetMap and uses GeoSPARQL ontologies and vocabularies. In Figure 1(a) we present the developed ontology for representing points of interest only. As in [3,9], every feature (i.e., point of interest) is assigned a number of thematic tags each of which consists of a keyvalue pair of strings. Each feature is tagged with key 1, every other feature with key 2, every fourth feature with key 4, etc. up to key 2 k, k N. This tagging makes it possible to select different parts of the entire dataset in a uniform way, and perform queries of various thematic selectivities. For example, if we selected all points of interest tagged with key 1, we would select all available points of interest, if we selected all points of interest tagged with key 2, we would select half of them, etc. Every feature has a spatial extent as well that is modelled using the GeoSPARQL vocabulary. The spatial extent of the land ownership dataset constitutes a uniform grid of n n hexagons. The land ownership dataset forms the basis for the spatial extent of all generated datasets since the size of each dataset is given relatively to the number n. By modifying the number of hexagons along 12 landownership, state, road, pointofinterest}

20 (a) Ontology for Points of Interest (b) Visualization of the geometric part of the synthetic dataset Fig. 1: Synthetic dataset an axis, we produce datasets of arbitrary size. As we will see in the following section, this enabled us to adjust the selectivity of the spatial predicates appearing in queries in a uniform way too. As in [14], the generated land ownership dataset consists of n 2 features with hexagonal spatial extent, where each hexagon is placed uniformly on a n n grid. The cardinality of the land ownerships is n 2. The generated state dataset consists of ( n 3 )2 features with hexagonal spatial extent, where each hexagon is placed uniformly on a n 3 n 3 grid. The cardinality of the state geometries is ( n 3 )2. The generated road dataset consists of n features with sloping line geometries. Half of the line geometries are roughly horizontal and the other half are roughly vertical. Each line consists of n line segments. The cardinality of the road geometries is n. The generated point of interest dataset consists of n 2 features with point geometries which are uniformly placed on n sloping, evenly spaced, parallel lines. The cardinality of the point of interest geometries is n 2. In Figure 1(b) we present a sample of the generated geometries. Queries. The synthetic workload generator produces SPARQL queries corresponding to spatial selection and spatial joins by instantiating the two query templates presented in Table 4. The query template used for producing SPARQL queries corresponding to spatial selections is identical to the query template used in [3,9]. In this query template, parameter THEMA is one of the values used when assigning tags to a feature and parameter GEOM is the WKT serialization of a rectangle. As in [9], we define the thematic selectivity of an instantiation of the query template as the fraction of the total features of a dataset that are tagged with a key equal to THEMA. For example, by altering the value of THEMA from 1 to 2, we reduce the thematic selectivity of the query by selecting half the nodes we previously did. We define the spatial selectivity of an instantiation of the query template as the fraction of the total features for which the topological relations defined

21 (a) SELECT?s WHERE {?s ns:hasgeometry/ns:aswkt?g.?s c:hastag/ns:haskey "THEMA". FILTER(FUNCTION(?g, "GEOM"))} (b) SELECT?s1?s2 WHERE {?s1 ns1:hasgeometry/ns1:aswkt?g1.?s1 ns1:hastag/ns1:haskey "THEMA".?s2 ns2:hasgeometry/ns2:aswkt?g2.?s2 ns2:hastag/ns2:haskey "THEMA ". FILTER(FUNCTION(?g1,?g2))} Table 4: Query templates for generating SPARQL queries corresponding to (a) spatial selections, and (b) spatial joins. by parameter FUNCTION holds between each of them and the rectangle defined by parameter GEOM. By modifying the value of the parameter namespace ns we specify the dataset and the corresponding type of geometric information that is examined by an instance of the query template. The query template used for producing SPARQL queries corresponding to spatial joins involves two datasets identified by the values of the parameter namespaces ns1 and ns2. In this query template as well, parameters THEMA and THEMA control the thematic selectivity of the query. The value of parameter FUNCTION defines the topological relation that must hold between instances of the two datasets that are involved in an instance of the query template. Parameter FUNCTION can be instantiated with every function defined in the Geometry Topology extension of GeoSPARQL. In our experiments, as described in Section 4.3, we used geof:sfintersects, geof:sftouches, geof:sfwithin. For example, by instantiating the query template (b) with the values poi for ns1, state for ns2, 1 for THEMA, 2 for THEMA and geof:sfwithin for FUNCTION, we get a SPARQL query that asks for all generated points of interest that are inside half of the generated states. These query templates allow us to generate SPARQL queries with great diversity regarding their spatial and thematic selectivity, thus stressing the optimizers of the geospatial RDF stores that we test and evaluating their effectiveness in identifying efficient query plans. 4 Benchmark Results In this section we present the results of running Geographica against the open source systems Strabon, Parliament and useekm that currently provide support for a rich subset of GeoSPARQL and stsparql. A comparison between these geospatial RDF stores and generic RDF stores that provide support only for point geometries is given in the long version of this paper. 4.1 Experimental Setup In this section we describe the setup of the experiments used to evaluate the selected triple stores. The machine that was used to run the benchmark is equipped with two Intel Xeon E5620 processors with 12MB L3 cache running at 2.4 GHz, 24 GB of RAM and a RAID-5 disk array that consists of four disks. Each disk has 32 MB of cache and its rotational speed is 7200 rpm.

22 Workload Strabon useekm Parliament Real-world 220 sec. 214 sec. 250 sec. Synthetic 221 sec. 406 sec. 462 sec. Table 5: Storing times Scenario Strabon useekm Parliament Reverse Geocoding 65s 0.77s 2.6s Map Search and Browsing 0.9s 0.6s 22.2 Rapid Mapping for Fire Monitoring 207.4s - - Table 6: Average Iteration times - Macro Scenarios Each query in the micro and the synthetic benchmark was run three times on cold and warm caches. For warm caches, we ran each query once before measuring the response time, in order to warm up the caches. We measured the response time for each query posed by measuring the elapsed time from submitting the query until a complete iteration over the results had been completed. The response time of each query was measured and the median of all measurements is reported. A timeout of one hour is set as a time limit for all queries. For the macro benchmark, we run each scenario many times (with different initialization each time) for one hour without cleaning the caches and we report the average time for a complete execution of all queries defined in each scenario. Strabon and useekm utilize Postgres enhanced with PostGIS as a spatially-enabled relational backend. For these systems, we set up an instance of Postgres 9.2 with PostGIS 2.0 and we tuned it to make better use of the system resources. For every dataset of Geographica, a unique property is used to connect geometries with their serialization (e.g. the Corine Land Use/Land cover ontology defines the property clc:aswkt), and this property is defined as a subproperty of the property geo:aswkt that is defined by GeoSPARQL. Parliament is able to identify and index a triple that represents the serialization of a geometric object only when the property geo:aswkt is used. As a result, the RDFS reasoning capabilities of Parliament have to be enabled so that it performs forward chaining during data loading and indexes the geometry using the spatial index as well. Strabon and useekm do not perform any reasoning on the input data. 4.2 Real-World Workload Dataset Storage. In this section we present the time required by each system to store and index the datasets of the real-world workload. Strabon uses a storing scheme which creates a predicate table for every unique predicate of input dataset. Usually, this choice leads to the creation of a large number of predicate tables and consequently a lot of time is required for storing and indexing. The bulk loader of Strabon emulates this per-predicate scheme but it merges in a single table predicates that are used rarely on a dataset, so it reduces the required storing time. useekm needs less time since it is based on the native repository of Sesame which is known to be the most efficient implementation of Sesame for average sized datasets. Parliament is reasonably slower than useekm as it requires more time to perform forward chaining on the input dataset, as described in Section 4.1. Micro Benchmark. The results of the micro benchmark are shown in Table 7 where the response time of each query is reported for both cold and warm caches.

23 First, the results of evaluating the queries with non-topological function are reported. For this class of queries useekm and Strabon have comparable response times while useekm is the fastest system. We observed that useekm does not utilize Postgres for evaluating these queries, but chooses to evaluate them using the native store of Sesame which is known to be more efficient, for small datasets, than Sesame implementations on top of a DBMS, like Strabon. Computing the area of polygons (Query 6) was tested only in useekm and Strabon since Parliament does not offer such functionality. We observe that none of the RDF stores highly exploits the warm caches when evaluating non-topological functions. This is because the non-topological functions used in this set of queries are computationally intensive (especially when complex geometries are used) and the time spent in the CPU dominates I/O time. In the case of spatial selections, Strabon and useekm have comparable response times while Strabon is the fastest system most of the times. Both systems choose to start the query evaluation process by evaluating the spatial part of a query in PostGIS using the spatial index that is available. useekm continues by evaluating the rest of the query using the native store of Sesame. This adds a small overhead compared to Strabon which evaluates the whole query in PostgreSQL and utilizes a unified dictionary encoding scheme for both thematic and spatial information. On the contrary, the optimizer of Parliament does not take into consideration filters containing GeoSPARQL functions, so it evaluates the spatial predicate exhaustively over the results of the thematic part of the query. Queries 14 and 15 are semantically equivalent. Both ask for points that have a given distance from a given point. However, Query 14 creates the buffer of a given point with radius r and asks for points which are within this buffer, while Query 15 asks for points that have distance less than r from the given point. useekm and Parliament evaluate both queries by starting with the thematic part of the query and then they evaluate exhaustively the spatial operations without using the spatial index. The difference in the response time of queries 14 and 15 for these systems is due to the fact that calculating the distance between two points is much cheaper that evaluating the corresponding point-in-polygon operation. Strabon follows a similar plan for evaluating Query 15. However, for Query 14, Strabon calculates the buffer of the given polygon, and uses it to probe the spatial index for discovering points that lie inside the constructed polygon. This choice is correct since the response times remains constant. In the case of spatial joins, useekm and Parliament are able to evaluate only queries 18 and 27 given the time limit of one hour. Parliament does not take into account GeoSPARQL extension functions during the optimization phase, resulting in query plans that evaluate separately the graph patterns corresponding to different graphs, compute the Cartesian product between them, and then apply the spatial predicate to the result of the Cartesian product. This strategy is very costly, thus Parliament is not able to answer most spatial joins given the time limit. useekm does not utilize PostGIS for evaluating spatial joins. Similarly to Parliament, it applies the spatial predicate to the result of the Cartesian product of the graph patterns. Strabon avoids evaluating Cartesian products by

24 Type Non topological construct functions Spatial selections Spatial joins Aggregate functions Cold caches (sec.) Warm caches (sec.) Query Strabon useekm Parliament Strabon useekm Parliament Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q >1h >1h >1h >1h Q >1h >1h >1h >1h Q >1h >1h >1h >1h Q >1h >1h >1h >1h Q >1h >1h >1h >1h Q >1h >1h >1h >1h Q >1h >1h >1h >1h Q Q >1h >1h >1h >1h Q Q Table 7: Response times - Real Workload identifying graph patterns that are related only through the spatial predicate and pushes the evaluation of the spatial join in PostGIS, thus resulting in good response times. In all cases, warm caches do not affect the response time of the queries since a large number of intermediate results is produced. Finally, spatial aggregations are tested only in Strabon since it is the only system that supports such functions. We notice that Query 28 which computes the minimum bounding box that contains all geometries of the GAG dataset is much faster than Query 29 which computes the union of the same geometries since the former operation is much cheaper than the latter one which is computationally expensive. Macro Benchmark. The results of the macro benchmark are shown in Table 6. In this table we report the average time needed for a complete iteration of all the queries of each scenario. The Reverse Geocoding scenario has two queries which use the function distance with a fixed limit. useekm performs the best in this scenario while Strabon needs an order of magnitude more time. The Map Search and Browsing scenario has one thematic query and two spatial selection queries. As described in Section 4.2 Strabon and useekm are efficient in evaluating spatial selections and they have good performace in this scenario as well. Finally, the Rapid Mapping for Fire Monitoring scenario is the most demanding scenario. It comprises three spatial selections queries, but also two complex queries which include spatial joins and construct new geometries (boundary and intersection). Only Strabon can serve this scenario since useekm and Parliament needed more than an hour to evaluate the query RM6. This happens because the query RM6 requires evaluating a demanding spatial join which is evaluated in a costy way by Parliament and useekm as described in previous paragraphs.

25 4.3 Synthetic Workload Let us now discuss representative experiments that we run using a synthetic workload that was produced using the generator presented in Section 3. We generated a dataset by setting n = 512 and k = 9, where n is the number used for defining the cardinalities of the generated geometries, and k is the number used for defining the cardinalities of the generated tag values. This instantiation of the synthetic generator produces 262, 144 land ownership instances, 28, 900 states, 512 roads and 262, 144 points of interest. Each feature is tagged with key 1, every other feature with key 2, etc. up to key 512. The resulting dataset consists of 3,880,224 triples and its size is 745 MB. Dataset Storage. Table 5 presents the time required by each system to store and index the synthetic dataset. The synthetic dataset has fewer predicates and more geometries that the real one. useekm requires more time than Strabon for storing the dataset, since it stores it in a Sesame native store and then it stores triples with geometric information at PostGIS as well. This overhead is significant compared to the total time required for storing the dataset, but leads to better response times in some cases. As we have already explained in Section 4.2, Parliament needs more time to store the synthetic dataset as well as the real-world dataset because it performs forward chaining on input data. Queries. We instantiated the query template presented in Table 4(a) in order to produce SPARQL queries corresponding to spatial selections that ask for land ownerships that intersect a given rectangle, and points of interest that are within a given recangle. The given rectangle is generated in such as a way that the spatial predicate of the query holds for 1, 10%, 25%, 50%, 75% or all the features of the respective dataset. In addition, we instantiated the query template using the extreme values 1 and 512 of the parameter THEMA for selecting either all or approximatly 2 of the total features of a dataset. The response time of each system for evaluating the instantiations of this query template are presented in Figures 2(a)-2(h). We instantiated the query template presented in Table 4(b) in order to produce SPARQL queries corresponding to spatial joins that ask for land ownerships that intersect a state, touching states and points of interest that are located inside a state. We also instantiated this query template using all combinations of the extreme values 1 and 512 for the parameters THEMA and THEMA. The response time of each system for evaluating the instantiations of this query template are presented in Figures 2(i)-2(k). By examining Figures 2(a)- 2(h), we observe that Strabon has very good performance overall. Strabon pushes the evaluation of a SPARQL query to the underlying spatially-enabled DBMS, which in this case is Postgres enhanced with PostGIS. PostGIS has recently been enhanced with selectivity estimation capabilities. As a result, when a query selects only a few geometries, query evaluation always starts with the evaluation of the spatial predicate using the

26 spatial index, thus resulting in few intermmediate results and good response times. While the spatial selectivity increases and more geometries satisfy the spatial predicate, the optimizer of Postgres chooses different query plans. For example, when the value of the parameter THEMA is 1 (Figures 2(a), 2(c), 2(e), 2(g)) and the value of the parameter GEOM is such that all geometries satisfy the spatial predicate, Postgres ignores the spatial index and performs a sequential scan on the table storing the geometries for evaluating the spatial predicate. Similarly, when the value of the parameter THEMA is 512 (Figures 2(b), 2(d), 2(f), 2(h)) and the value of the parameter GEOM is such that all geometries satisfy the spatial predicate, Postgres starts with the evaluation of the thematic selection that produces few intermediate results since only 2 of the features satisfy the thematic predicate, resulting in good query response times. In the case of spatial joins (Figures 2(i)- 2(k)), Strabon is the fastest system in most cases. The optimizer of Postgres takes into account the thematic selectivity of the queries and selects good query plans, thus Strabon is the only system that is able to answer the spatial joins given the one hour timeout when the parameters THEMA and THEMA are equal to 1. Regarding useekm, we observe that its performance is not affected by the thematic selectivity of the query. For spatial selections, useekm always start by evaluating the spatial predicate in PostGIS and then continues the query evaluation in the native Sesame store. As a result, regardless of the thematic selectivity, the response time of useekm increases while increasing the number of features with geometries that satisfy the given spatial predicate. Regarding Parliament, we observe that its performance is not affected neither by the thematic nor by the spatial selectivity of a query. Parliament always starts by evaluating the non-spatial part of a query and then evaluates the thematic filter and the spatial predicate exhaustively on the intermediate results. Thus, the thematic and spatial selectivity of a query do not affect its response time. In the case of spatial joins, useekm and Parliament produce the Cartesian product between the graph patterns that are joined through the spatial predicate, and evaluate the spatial predicate afterwards. This strategy is very costly, thus Parliament is not able to answer most spatial joins given the one hour timeout and useekm is more than two orders of magnitude slower than Strabon. However, in Figure 2(j) we observe that useekm outperforms Strabon. Strabon stores all geometries in a single table, so the evaluation of the spatial predicate T ouches on this table returns not only the geometries of states that touch each other, but the touching geometries of land ownerships as well. The touching geometries of land ownerships are discarded later on, but this overhead proves to be more costly than producing a Cartesian product and evaluating the spatial predicate afterwards. 5 Conclusions We presented a benchmark for evaluating the performance of geospatial RDF stores. We defined two workloads that test on the one hand the performance

27 of the spatial component of such systems in isolation, and on the other hand test whether spatial query processing is deeply integrated in their query engines. Future work concentrates on extending the benchmark to capture the complete GeoSPARQL standard, publish larger real-world datasets and synthetic datasets that are not uniformly distributed, and evaluate them on centralized and distributed geospatial RDF stores that are beginning to emerge. References 1. Bereta, K., Smeros, P., Koubarakis, M.: Representing and Querying the valid time of triples for Linked Geospatial Data. In: ESWC (2013) 2. Bizer, C., Schultz, A.: The Berlin SPARQL Benchmark. In: IJSWIS. vol. 5 (2009) 3. Brodt, A., Nicklas, D., Mitschang, B.: Deep Integration of Spatial Query Processing into Native RDF Triple Stores. In: SIGSPATIAL (2010) 4. Gunther, O., Picouet, P., Saglio, J.M., Scholl, M., Oria, V.: Benchmarking Spatial Joins À La Carte. In: IJGIS. vol. 13 (2007) 5. Guo, Y., Pan, Z., Heflin, J.: LUBM: A Benchmark for OWL Knowledge Base Systems. In: Web Semantics. vol. 3 (2005) 6. Kolas, D.: A Benchmark for Spatial Semantic Web Systems. In: International Workshop on Scalable Semantic Web Knowledge Base Systems (2008) 7. Koubarakis, M., Kontoes, C., Manegold, S., Karpathiotakis, M., Kyzirakos, K., Bereta, K., Garbis, G., Nikolaou, C., Michail, D., Papoutsis, I., Herekakis, T., Ivanova, M., Zhang, Y., Pirk, H., Kersten, M., Dogani, K., Giannakopoulou, S., Smeros, P.: Real-Time Wildfire Monitoring Using Scientific Database and Linked Data Technologies. In: EDBT (2013) 8. Koubarakis, M., Karpathiotakis, M., Kyzirakos, K., Nikolaou, C., Sioutis, M.: Data Models and Query Languages for Linked Geospatial Data. In: Reasoning Web. Semantic Technologies for Advanced Query Answering. Springer (2012) 9. Kyzirakos, K., Karpathiotakis, M., Koubarakis, M.: Strabon: A Semantic Geospatial DBMS. In: ISWC (2012) 10. Morsey, M., Lehmann, J., Auer, S., Ngomo, A.C.N.: DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data. In: ISWC (2011) 11. Myllymaki, J., Kaufman, J.H.: DynaMark: A Benchmark for Dynamic Spatial Indexing. In: Mobile Data Management. vol (2003) 12. Open Geospatial Consortium: OGC GeoSPARQL - A geographic query language for RDF data. OGC Implementation Standard (2012) 13. Patel, J., Yu, J., Kabra, N., Tufte, K., Nag, B., Burger, J., Hall, N., Ramasamy, K., Lueder, R., Ellmann, C., Kupsch, J., Guo, S., Larson, J., De Witt, D., Naughton, J.: Building a Scaleable Geo-Spatial DBMS: Technology, Implementation, and Evaluation. In: ACM SIGMOD (1997) 14. Paton, N.W., Williams, M.H., Dietrich, K., Liew, O., Dinn, A., Patrick, A.: VESPA: A Benchmark for Vector Spatial Databases. In: BNCOD (2000) 15. Ray, S., Simion, B., Demke Brown, A.: Jackpine: A Benchmark to Evaluate Spatial Database Performance. In: ICDE (2011) 16. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2Bench: A SPARQL Performance Benchmark. In: ICDE. pp (2009) 17. Stonebraker, M., Frew, J., Gardels, K., Meredith, J.: The SEQUOIA 2000 Storage Benchmark. In: ACM SIGMOD (1993)

28 Strabon useekm Parliament Strabon useekm Parliament Strabon useekm Parliament Strabon useekm Parliament % 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% (a) % of Nodes in query region Intersects tag 1, cold caches response time [sec] response time [sec] response time [sec] response time [sec] (b) % of Nodes in query region Intersects tag 512, cold caches (c) % of Nodes in query region Within tag 1, cold caches (d) % of Nodes in query region Within tag 512, cold caches Strabon useekm Parliament Strabon useekm Parliament Strabon useekm Parliament Strabon useekm Parliament % 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% (e) % of Nodes in query region Intersects tag 1, warm caches response time [sec] response time [sec] response time [sec] response time [sec] (f) % of Nodes in query region Intersects tag 512, warm caches (g) % of Nodes in query region Within tag 1, warm caches (h) % of Nodes in query region Within tag 512, warm caches Strabon useekm Parliament Strabon useekm Parliament Strabon useekm Parliament cold caches warm caches cold caches warm caches cold caches warm caches (i) Intersects (j) Touches (k) Within response time [sec] response time [sec] response time [sec] Fig. 2: Response times - Synthetic Workload

29 TELEIOS FP The Strabon evaluation by the GeoKnow project This section discusses the results of an independent evaluation of Strabon that was carried out in the context of the EU project GeoKnow 11 and was presented in the GeoKnow deliverable D2.1.1 Market and Research Overview, authored by S. Athanasiou, L. Bezati, G. Giannopoulos, K. Patroumpas, and D. Skoutas [ABG + 13]. That deliverable was published on May 27, The results reported in [ABG + 13], regarding the performance of Strabon compared to other geospatial RDF stores, differ significantly from the results that we have measured and presented in [KKKK11, KKK10] where we compare the performance of Strabon on top of two spatial relational databases (PostGIS and System X) against the implementation of [BNM10], the system Parliament, and a naive implementation. In [KKK10] we evaluated the performance of each system using a workload based on linked data that consisted of 137 million triples and a synthetic workload that consisted of 500 million triples. The size of each dataset was 30GB and 54GB respectively. In addition, the results reported in [ABG + 13] differ significantly with the results that we obtained after applying the benchmark for geospatial RDF stores that we developed in [GKK13] to Strabon, useekm, Parliament, AllegroGraph, and Virtuoso. More information about the benchmark is given in Section 2.1 of this deliverable and will also be presented in [GKK13]. Because we strongly disagree with the findings of [ABG + 13] concerning Strabon, in this section, we repeat the benchmark proposed in [ABG + 13] and examine the reported results regarding its performance. We see that just by using the latest versions of Strabon, PostgreSQL, and PostGIS that were publicly available at the time of writing of deliverable [ABG + 13] and configuring the above systems appropriately the performance of Strabon improves dramatically. Experimental setup For the evaluation of the geospatial RDF stores presented in [ABG + 13], the authors set up five pre-built Virtual Machines (VM) each one corresponding to a working installation of an RDF store, namely Virtuoso, Parliament, useekm, OWLIM-SE, and Strabon. The virtual machines are publicly available for download 12. In Tables 2.2 and 2.3 we present the versions of the geospatial RDF stores and spatially-enabled RDBMS that the authors of [ABG + 13] choose to evaluate, accompanied with the release date of each version. For completeness, we also present the latest version of each system that was available at the time of writing of [ABG + 13]. We cloned the respective VM with the experiments that were executed in Strabon and described in detail in [ABG + 13]. We installed an image of this VM on an Intel Xeon E5620 with 12MB L3 caches running at 2.4 GHz. The system has 24GB of RAM, 4 disks of striped RAID (level 5) and the operating system installed is Ubuntu Each disk has 32MB cache and its rotational speed is 7200 rpm. We ran our queries on cold caches as this approach was also followed in [ABG + 13]. Notice that the evaluation server hosting the VMs in the case of [ABG + 13] differs from ours since it is an Intel Core i CPU with 10MB cache running at 3.60GHz. However, we kept the same VM specifications for running the experiments, meaning that the VM was configured to utilize 4 (virtual) CPU cores, 8GB of main memory, and 40 GB of available storage running Linux Debian 6.x Squeeze, 64-bit. It is important to mention that due to the fact that the evaluation servers used for hosting the VMs differ, the experimental results that are presented in the following differ as well. Therefore, in order to showcase how the setup, configuration, and tuning of a system, and in particular for our case Strabon, plays a crucial role on how the system performs, we first repeat the benchmark rss&utm_campaign=virtual-machines-of-geospatial-rdf-stores D4.3 The evaluation of the developed implementation 22

30 TELEIOS FP Table 2.2: Geospatial RDF stores evaluated in [ABG + 13] Latest Underlying Latest Release of System Release Release RDBMS RDBMS Virtuoso Universal Server 7.0 (April 24, 2013) Parliament (Nov. 9, 2012) useekm a5 (April 2, 2013) (July 2, 1013) PostgreSQL PostgreSQL (April 4, 2013), (April 4, 2013), PostGIS PostGIS (June 25, 2011) (March 1, 2013) OWLIM-SE (Oct. 25, 2012) Strabon (Sept. 5, 2012) (March 26, 2013) PostgreSQL (April 4, 2013), PostGIS (June 25, 2011) Table 2.3: Spatially-enabled RDBMS evaluated in [ABG + 13] System Release Latest Release (at the time of writing of [ABG + 13]) Oracle Spatial 11gR2 11gR2 PostgreSQL (April 4, 2013), PostGIS (March 1, 2013) PostgreSQL / PostGIS (April 4, 2011) / (June 25, 2011) (April 4, 2013) / (March 1, 2013) of [ABG + 13] using the pre-configured VM corresponding to Strabon as the baseline of our experiments, and then we repeat the benchmark again for different configurations. In this way, even though we cannot compare the actual performance of Strabon in absolute numbers, we can compare it relatively to the performance of the baseline configuration, pinpointing at the percentage improvement we gain. Repeating the GeoKnow experiments Table 2.4 (respectively Figure 2.1) presents the response times of Strabon following the approach described in [ABG + 13]. The first column of the table refers to the symbolic names that have been given by the authors of [ABG + 13] to each query that has been defined in their experiments. The interested reader can refer to [ABG + 13] for a detailed presentation of the datasets and the queries that are involved. The second column of the table depicts the system response times as they have been reported in [ABG + 13]. We pursued the following procedure: First, we reproduced the experiments described in [ABG + 13] by keeping the original set up of the VM, i.e., untuned PostgreSQL with PostGIS and Strabon v The results are shown in the third column of the table 2.4. We observe that the query response time is better in our case, despite the fact that we used exactly the same configuration, but this is reasonable, as we ran the experiments on a different machine. This completely untuned configuration will serve as the baseline for the experiments that we executed in this context. We also measured the time required to load the data using the Strabon BulkLoader and report in the row BL_SI. Although the bulk loader of Strabon is mentioned in [ABG + 13] and in the README file of the source code of Strabon, the authors of [ABG + 13] chose to load the data using the StoreOp function which is known to have poor performance. When using the BulkLoader we observed that the time to store the dataset decreased by 85.27%. Second, we tuned PostgreSQL as we explained in the paper [KKK12b], in the README file of the source code of Strabon, and in the online documentation, i.e., the online user s D4.3 The evaluation of the developed implementation 23

31 TELEIOS FP guide 13 and the developer s guide 14. The fact that Strabon performs significantly better with aggressive tuning of PostgreSQL is mentioned in [ABG + 13], however all experiments of [ABG + 13] were performed with PostgreSQL being completely untuned. As we can see in the fourth and fifth columns of Table 2.4, the response time decreases considerably. Tuning PostgreSQL allows the usage of more shared buffers and the allocation of larger amounts of memory for internal sort operations and hash tables, instead of using secondary storage. This leads to significant performance improvement. Third, we noticed that the version of Strabon that was used in the deliverable of GeoKnow was not the latest version available at that time. The most recent version at the time this deliverable was written was Strabon that was released two months prior to the delivery of [ABG + 13]. Furthermore, the versions of PostgreSQL and PostGIS used were also outdated. Despite the fact that in the experimental evaluation and comparison of spatially-enabled relational databases, which is also included in the same document, the authors of [ABG + 13] use the most recent versions of PostgreSQL and PostGIS, the considerably older versions PostgreSQL and PostGIS 1.5.3, released in 2013 and 2011 respectively, were used to form the back-end of Strabon 15. So we run the experiments once more using Strabon 3.2.8, PostgreSQL 9.2.4, and PostGIS which were the latest versions of these systems at the time of writing of [ABG + 13]. We tuned PostgreSQL to make better use of the system s resources as described in the README file of the source code of Strabon and in the online documentation. These results are shown in the sixth column of Table 2.4. Here we can observe the following: The query response time decreases further. This happens because in this setup Strabon is based on newest, tuned versions of PostgreSQL and PostGIS. In addition, the latest version of PostGIS has recently been enhanced with selectivity estimation capabilities that leads to the selection of better plans for evaluating a query. This is a crucial point that has been also discussed extensively in the paper of Strabon [KKK12b]. The queries R1, G2, D1, and D2 that crashed before, in this round of experiments they are executed normally and they return results. The reason is that in the previous experiments the wrong namespace of GeoSPARQL was being used. GeoSPARQL unfortunately changed at some point the namespaces that were being used for the definition of spatial extension functions leading to confusions. As a result, the authors of [ABG + 13] were using the latest namespaces of GeoSPARQL on an older version of Strabon that naturally supported the old version of GeoSPARQL namespaces. Now that we switched to the latest version of Strabon that was current at the time of the GeoKnow deliverable, the queries are executed normally because the new version of the GeoSPARQL namespaces is supported. Last, we executed the experiments using the most recent version of Strabon which is available in the Strabon Mercurial repository and keeping exactly the same back-end configuration as in the previous step. We observe in the last two columns of Table 2.4 that, in most cases, the respective query response times are slightly improved compared to the ones of the previous experiment Although one may consider PostgreSQL to be quite a recent release since its release date coincides with that of PostgreSQL 9.2.4, internally the two versions differ very much, with the second version (9.2.4) being much more robust than the first one (8.4.17), since they belong to different major releases, i.e., 8.4 and 9.2 respectively, in which PostgreSQL chooses to introduce new features, enhancements, and other significant changes. On the other hand, minor releases, such as and 9.2.4, correspond to releases containing only bug fixes of the corresponding major release. Notice that after PostgreSQL shipped the major release 8.4 (released on July 1, 2009), the major releases 9.0 (released on September 20, 2010), 9.1 (released on September 12, 2011), and 9.2 (released on September 10, 2012) followed, each one adding further functionality and improvements to the system. D4.3 The evaluation of the developed implementation 24

32 TELEIOS FP Table 2.4: Execution of experiments using datasets and queries defined in [ABG + 13] w.r.t. different configurations Measurement GeoKnow results (ms) Reproduction of GeoKnow exp. (baseline) Strabon v3.2.3, PostGIS 1.5.3, PG 8.4 tuned (ms) Percentage deviation from baseline Strabon v3.2.8, PostGIS 2.0.3, PG tuned (ms) Percentage deviation from baseline Strabon tip, PostGIS 2.0.3, PG tuned (ms) Percentage deviation from baseline records 25,796,782 BL+SI 156,585, ,161, crashes out of memory 21,985, BL_Sl (Loader) 1,497, ,442, % 1,281, % 1,281, % L1 305, % % % L2 225, , % % R1 Crashed 9, , % 1, % 1, % after submission R2 1,847, , , % 42, % 60, % SJ1 incomplete 244, , % >4 hours - 132, % after 4 hours SJ1a N/A N/A N/A N/A N/A N/A N/A N/A SJ2 incomplete crashes crashes - crashes - crashes - after >6 hours SJ2a N/A N/A N/A N/A N/A N/A N/A N/A knn1 crashed after 26, , % 11, % 11, % submission knn1a 171, , , % 11, % 10, % knn2 1,455, , , % 15, % 15, % knn2a N/A N/A N/A N/A N/A N/A N/A N/A G1 177, , , % 2, % 2, % G2 crashed after crashes , G3 697, % % % D1 crashed after 2, , % 1, % % submission D1a N/A N/A N/A N/A N/A N/A N/A N/A D2 crashed after 2, , , % 1, % submission D2a N/A N/A N/A N/A N/A N/A N/A N/A A1 517, , , % 50, % 40, % N1 976, % % % Summary In this section we ran a number of experiments to evaluate Strabon, following the approach reported in [ABG + 13] and we deviated from that approach into the following directions: We used the latest versions of PostgreSQL (PostgreSQL 9.2.4) and PostGIS (PostGIS 2.0.3). We used the latest version of Strabon, as well as the version of Strabon that was available at the time deliverable [ABG + 13] was in preparation. This version supports, among other things, the most recent namespaces of GeoSPARQL that are used in the queries of the experiments. We tuned PostgreSQL in order to make better use of the system resources, similarly to what other systems like Virtuoso and Oracle propose, by following the steps that we have documented in the README file of the source code of Strabon and in the online documentation. We showed that the above configurations improve the performance of Strabon dramatically, and this is reasonable as we have observed in the past in our publications [KKK12b,GKK13,KKKK11]. However, we still cannot compare the above measurements with the ones of the other systems reported in [ABG + 13] for two reasons. Firstly, because we have to run the respective experiments on the same set up (as we did for Strabon) and this setup in not available to us. Secondly, because deliverable [ABG + 13] does not evaluate some of the queries across all systems considered even though they are fully supported. For example, while in [ABG + 13] the named query SJ2 is evaluated across all systems its variant query SJ2a which contains an additional limit modifier is evaluated against Parliament, OWLIM-SE, Oracle Spatial, and PostGIS, but not against Strabon. The same applies for the queries KNN2a, D1a, and D2a, which are fully supported by Strabon. D4.3 The evaluation of the developed implementation 25

33 TELEIOS FP Since in our opinion and as we have shown in the paper [GKK13], Strabon is both functionally and performance-wise the best available geospatial RDF store at the moment, we close with a recommendation to other researchers that would like to compare their work to Strabon. We strongly recommend that the functional comparison of Strabon with other spatial RDF stores should take into account the latest version of Strabon, and that the experiments for measuring the response time of Strabon should be performed by taking into account all issues that we have discussed in this section and we have written in the past in the README file of the source code of Strabon and in the online documentation. D4.3 The evaluation of the developed implementation 26

34 TELEIOS FP GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T 0 BL+SI BL_Sl Query code name Execution time - seconds crash out of memory Execution time - ms 200 L1 L2 Query code name Execution time - ms R1 R2 Query code name (a) BL query family GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T (b) L query family GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T (c) R query family GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T timeout (4h) N/A crashes N/A SJ1 SJ1a SJ2 SJ2a Query code name knn1 knn1a knn2 knn2a Query code name (d) SJ query family Execution time - seconds Execution time - ms N/A crash 2000 Execution time - ms G1 G2 G3 Query code name GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T (e) knn query family GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T (f) G query family GeoKnow(baseline) Strabon_v3.2.3_PSQL8.4_PG1.5_T Strabon_v3.2.8_PSQL9.2_PG2.0_T Strabon_tip_PSQL9.2_PG2.0_T N/A N/A D1 D1a D2 D2a Query code name A1 Query code name N1 Query code name (g) D query family Execution time - ms Execution time - ms Execution time - ms (h) A query family (i) N query family Figure 2.1: Execution times of different version/configurations of Strabon per query family D4.3 The evaluation of the developed implementation 27

35 TELEIOS FP Evaluation of the temporal features of Strabon Our work in this area is given by the following paper Representing and Querying the valid time of triples for Linked Geospatial Data which was presented at ESWC2013. D4.3 The evaluation of the developed implementation 28

36 Representation and Querying of Valid Time of Triples in Linked Geospatial Data Konstantina Bereta, Panayiotis Smeros, and Manolis Koubarakis National and Kapodistrian University of Athens, Greece {Konstantina.Bereta, psmeros, Abstract. We introduce the temporal component of the strdf data model and the stsparql query language, which have been recently proposed for the representation and querying of linked geospatial data that changes over time. With this temporal component in place, stsparql becomes a very expressive query language for linked geospatial data, going beyond the recent OGC standard GeoSPARQL, which has no support for valid time of triples. We present the implementation of the stsparql temporal component in the system Strabon, and study its performance experimentally. Strabon is shown to outperform all the systems it has been compared with. 1 Introduction The introduction of time in data models and query languages has been the subject of extensive research in the field of relational databases [6, 20]. Three distinct kinds of time were introduced and studied: user-defined time which has no special semantics (e.g., January 1st, 1963 when John has his birthday), valid time which is the time an event takes place or a fact is true in the application domain (e.g., the time when John is a professor) and transaction time which is the time when a fact is current in the database (e.g., the system time that gives the exact period when the tuple representing that John is a professor from 2000 to 2012 is current in the database). In these research efforts, many temporal extensions to SQL92 were proposed, leading to the query language TSQL2, the most influential query language for temporal relational databases proposed at that time [20]. However, although the research output of the area of temporal relational databases has been impressive, TSQL2 did not make it into the SQL standard and the commercial adoption of temporal database research was very slow. It is only recently that commercial relational database systems started offering SQL extensions for temporal data, such as IBM DB2, Oracle Workspace manager, and Teradata [2]. Also, in the latest standard of SQL (SQL:2011), an important new feature is the support for valid time (called application time) and transaction time. Each SQL:2011 table is allowed to have at most two periods (one for This work was supported in part by the European Commission project TELEIOS (

37 application time and one for transaction time). A period for a table T is defined by associating a user-defined name e.g., EMPLOYMENT TIME (in the case of application time) or the built-in name SYSTEM TIME (in the case of transaction time) with two columns of T that are the start and end times of the period (a closed-open convention for periods is followed). These columns must have the same datatype, which must be either DATE or a timestamp type (i.e., no new period datatype is introduced by the standard). Finally, the various SQL statements are enhanced in minimal ways to capture the new temporal features. Compared to the relational database case, little research has been done to extend the RDF data model and the query language SPARQL with temporal features. Gutierrez et al. [8, 9] were the first to propose a formal extension of the RDF data model with valid time support. They also introduce the concept of anonymous timestamps in general temporal RDF graphs, i.e., graphs containing quads of the form (s, p, o)[t] where t is a timestamp or an anonymous timestamp x stating that the triple (s, p, o) is valid in some unknown time point x. The work described in [11] subsequently extends the concept of general temporal RDF graphs of [9] to express temporal constraints involving anonymous timestamps. In the same direction, Lopes et al. integrated valid time support in the general framework that they have proposed in [15] for annotating RDF triples. Similarly, Tappolet and Bernstein [22] have proposed the language τ-sparql for querying the valid time of triples, showed how to transform τ-sparql into standard SPARQL (using named graphs), and briefly discussed an index that can be used for query evaluation. Finally, Perry [19] proposed an extension of SPARQL, called SPARQL-ST, for representing and querying spatiotemporal data. The main idea of [19] is to incorporate geospatial information to the temporal RDF graph model of [9]. The query language SPARQL-ST adds two new types of variables, namely spatial and temporal ones, to the standard SPARQL variables. Temporal variables (denoted by a # prefix) are mapped to time intervals and can appear in the fourth position of a quad as described in [9]. In SPARQL-ST two special filters are introduced: SPATIAL FILTER and TEMPORAL FILTER. They are used to filter the query results with spatial and temporal constraints (OGC Simple Feature Access topological relations and distance for the spatial part, and Allen s interval relations [3] for the temporal part). Following the ideas of Perry [19], our group proposed a formal extension of RDF, called strdf, and the corresponding query language stsparql for the representation and querying of temporal and spatial data using linear constraints [13]. strdf and stsparql were later redefined in [14] so that geometries are represented using the Open Geospatial Consortium standards Well-Known-Text (WKT) and Geography Markup Language (GML). Both papers [13] and [14] mention very briefly the temporal dimension of strdf and do not go into details. Similarly, the version of the system Strabon presented in [14], which implements strdf and stsparql, does not implement the temporal dimension of this data model and query language. In this paper we remedy this situation by introducing all the details of the temporal dimension of strdf and stsparql and implementing it in Strabon.

38 The original contributions of this paper are the following. We present in detail, for the first time, the valid time dimension of the data model strdf and the query language stsparql. Although the valid time dimension of strdf and stsparql is in the spirit of [19], it is introduced in a language with a much more mature geospatial component based on OGC standards [14]. In addition, the valid time component of stsparql offers a richer set of functions for querying valid times than the ones in [19]. With the temporal dimension presented in this paper, stsparql also becomes more expressive than the recent OGC standard GeoSPARQL [1]. While stsparql can represent and query geospatial data that changes over time, GeoSPARQL only supports static geospatial data. We discuss our implementation of the valid time component of strdf and stsparql in Strabon. We evaluate the performance of our implementation on two large real-world datasets and compare it to three other implementations: (i) a naive implementation based on the native store of Sesame which we extended with valid time support, (ii) AllegroGraph, which, although it does not offer support for valid time of triples explicitly, it allows the definition of time instants and intervals and their location on a time line together with a rich set of functions for writing user queries, and (iii) the Prolog-based implementation of the query language AnQL 1, which is the only available implementation with explicit support for valid time of triples. Our results show that Strabon outperforms all other implementations. This paper is structured as follows. In Section 2 we introduce the temporal dimension of the data model strdf and in Section 3 we present the temporal features of the query language stsparql. In Section 4 we describe how we extended the system Strabon with valid time support. In Section 5 we evaluate our implementation experimentally and compare it with other related implementations. In Section 6 we present related work in this field. Section 7 concludes this paper. 2 Valid Time Representation in the Data Model strdf In this section we describe the valid time dimension of the data model strdf presented in [14]. The time line assumed is the (discrete) value space of the datatype xsd:datetime of XML-Schema. Two kinds of time primitives are supported: time instants and time periods. A time instant is an element of the time line. A time period (or simply period) is an expression of the form [B,E), (B,E], (B,E), or [B,E] where B and E are time instants called the beginning and the ending of the period respectively. Since the time line is discrete, we often assume only periods of the form [B,E) with no loss of generality. Syntactically, time periods are represented by literals of the new datatype strdf:period that we introduce in strdf. The value space of strdf:period is the set of all time periods covered by the above definition. The lexical space of strdf:period is trivially defined from the lexical space of xsd:datetime and the closed/open pe- 1

39 riod notation introduced above. Time instants can also be represented as closed periods with the same beginning and ending time. Values of the datatype strdf:period can be used as objects of a triple to represent user-defined time. In addition, they can be used to represent valid times of temporal triples which are defined as follows. A temporal triple (quad) is an expression of the form s p o t. where s p o. is an RDF triple and t is a time instant or a time period called the valid time of a triple. An strdf graph is a set of triples and temporal triples. In other words, some triples in an strdf graph might not be associated with a valid time. We also assume the existence of temporal constants NOW and UC inspired from the literature of temporal databases [5]. NOW represents the current time and can appear in the beginning or the ending point of a period. It will be used in stsparql queries to be introduced in Section 3. UC means Until Changed and is used for introducing valid times of a triple that persist until they are explicitly terminated by an update. For example, when John becomes an associate professor in 1/1/2013 this is assumed to hold in the future until an update terminates this fact (e.g., when John is promoted to professor). Example 1. The following strdf graph consists of temporal triples that represent the land cover of an area in Spain for the time periods [2000, 2006) and [2006, UC) and triples which encode other information about this area, such as its code and the WKT serialization of its geometry extent. In this and following examples, namespaces are omitted for brevity. The prefix strdf stands for where one can find all the relevant datatype definitions underlying the model strdf. corine:area_4 rdf:type corine:area. corine:area_4 corine:hasid "EU ". corine:area_4 corine:haslandcover corine:coniferousforest "[ T00:00:00, T00:00:00)"^^strdf:period. corine:area_4 corine:haslandcover corine:naturalgrassland "[ T00:00:00,UC)"^^strdf:period. corine:area_4 corine:hasgeometry "POLYGON(( ,...))"^^strdf:WKT. The strdf graph provided above is written using the N-Quads format 2 which has been proposed for the general case of adding context to a triple. The graph has been extracted from a publicly available dataset provided by the European Environmental Agency (EEA) that contains the changes in the CORINE Land Cover dataset for the time period [2000, UC) for various European areas. According to this dataset, the area corine:area_4 has been a coniferous forest area until 2006, when the newer version of CORINE showed it to be natural grassland. Until the CORINE Land cover dataset is updated, UC is used to denote the persistence of land cover values of 2006 into the future. The last triple of the strdf graph gives the WKT serialization of the geometry of the area (not all vertices of the polygon are shown due to space considerations). This dataset will be used in our examples but also in the experimental evaluation of Section

40 3 Querying Valid Times Using stsparql The query language stsparql is an extension of SPARQL 1.1. Its geospatial features have been presented in [12] and [14]. In this section we introduce for the first time the valid time dimension of stsparql. The new features of the language are: Temporal Triple Patterns. Temporal triple patterns are introduced as the most basic way of querying temporal triples. A temporal triple pattern is an expression of the form s p o t., where s p o. is a triple pattern and t is a time period or a variable. Temporal Extension Functions. Temporal extension functions are defined in order to express temporal relations between expressions that evaluate values of the datatypes xsd:datetime and strdf:period. The first set of such temporal functions are 13 Boolean functions that correspond to the 13 binary relations of Allen s Interval Algebra. stsparql offers nine functions that are syntactic sugar i.e., they encode frequently-used disjunctions of these relations. There are also three functions that allow relating an instant with a period: xsd:boolean strdf:during(xsd:datetime i2, strdf:period p1): returns true if instant i2 is during the period p1. xsd:boolean strdf:before(xsd:datetime i2, strdf:period p1): returns true if instant i2 is before the period p1. xsd:boolean strdf:after(xsd:datetime i2, strdf:period p1): returns true if instant i2 is after the period p1. The above point-to-period relations appear in [16]. The work described in [16] also defines two other functions allowing an instant to be equal to the starting or ending point of a period. In our case these can be expressed using the SPARQL 1.1. operator = (for values of xsd:datetime) and functions period start and period end defined below. Furthermore, stsparql offers a set of functions that construct new (closedopen) periods from existing ones. These functions are the following: strdf:period strdf:period intersect(period p1, period p2): This function is defined if p1 intersects with p2 and it returns the intersection of period p1 with period p2. strdf:period strdf:period union(period p1, period p2): This function is defined if period p1 intersects p2 and it returns a period that starts with p1 and finishes with p2. strdf:period strdf:minus(period p1, period p2): This function is defined if periods p1 and p2 are related by one of the Allen s relations overlaps, overlappedby, starts, startedby, finishes, finishedby and it returns the a period that is constructed from period p1 with its common part with p2 removed. strdf:period strdf:period(xsd:datetime i1, xsd:datetime i2): This function constructs a (closed-open) period having instant i1 as beginning and instant i2 as ending time.

41 There are also the functions strdf:period start and strdf:period end that take as input a period p and return an output of type xsd:datetime which is the beginning and ending time of the period p respectively. Finally, stsparql defines the following functions that compute temporal aggregates: strdf:period strdf:intersectall(set of period p): Returns a period that is the intersection of the elements of the input set that have a common intersection. strdf:period strdf:maximalperiod(set of period p): Constructs a period that begins with the smallest beginning point and ends with the maximum endpoint of the set of periods given as input. The query language stsparql, being an extension of SPARQL 1.1, allows the temporal extension functions defined above in the SELECT, FILTER and HAV- ING clause of a query. A complete reference of the temporal extension functions of stsparql is available on the Web 3. Temporal Constants. The temporal constants NOW and UC can be used in queries to retrieve triples whose valid time has not ended at the time of posing the query or we do not know when it ends, respectively. The new expressive power that the valid time dimension of stsparql adds to the version of the language presented in [14], where only the geospatial features were presented, is as follows. First, a rich set of temporal functions are offered to express queries that refer to temporal characteristics of some non-spatial information in a dataset (e.g., see Examples 2, 3 and 6 below). In terms of expressive power, the temporal functions of stsparql offer the expressivity of the qualitative relations involving points and intervals studied by Meiri [16]. However, we do not have support (yet) for quantitative temporal constraints in queries (e.g., T 1 T 2 5). Secondly, these new constructs can be used together with the geospatial features of stsparql (geometries, spatial functions, etc.) to express queries on geometries that change over time (see Examples 4 and 5 below). The temporal and spatial functions offered by stsparql are orthogonal and can be combined with the functions offered by SPARQL 1.1 in arbitrary ways to query geospatial data that changes over time (e.g., the land cover of an area) but also moving objects [10] (we have chosen not to cover this interesting application in this paper). In the rest of this section, we give some representative examples that demonstrate the expressive power of stsparql. Example 2. Temporal selection and temporal constants. Return the current land cover of each area mentioned in the dataset. SELECT?clcArea?clc WHERE {?clcarea rdf:type corine:area; corine:haslandcover?clc?t. FILTER(strdf:during(NOW,?t))} This query is a temporal selection query that uses an extended Turtle syntax that we have devised to encode temporal triple patterns. In this extended syntax, the 3

42 fourth element is optional and it represents the valid time of the triple pattern. The temporal constant NOW is also used. Example 3. Temporal selection and temporal join. Give all the areas that were forests in 1990 and were burned some time after that time. SELECT?clcArea WHERE{?clcArea rdf:type corine:area ; corine:haslandcover corine:coniferousforest?t1 ; corine:haslandcover corine:burnedarea?t2 ; FILTER(strdf:during(?t1, " T00:00:00"^^xsd:dateTime) && strdf:after(?t2,?t1))} This query shows the use of variables and temporal functions to join information from different triples. Example 4. Temporal join and spatial metric function. Compute the area occupied by coniferous forests that were burnt at a later time. SELECT?clcArea (SUM(strdf:area(?geo)) AS?totalArea) WHERE {?clcarea rdf:type corine:area; corine:haslandcover corine:coniferousforest?t1 ; corine:haslandcover corine:burntarea?t2 ; corine:hasgeometry?geo. FILTER(strdf:before(?t1,?t2))} GROUP BY?clcArea In this query, a temporal join is performed by using the temporal extension function strdf:before to ensure that areas included in the result set were covered by coniferous forests before they were burnt. The query also uses the spatial metric function strdf:area in the SELECT clause of the query that computes the area of a geometry. The aggregate function SUM of SPARQL 1.1 is used to compute the total area occupied by burnt coniferous forests. Example 5. Temporal join and spatial selection. Return the evolution of the land cover use of all areas contained in a given polygon. SELECT?clc1?t1?clc2?t2 WHERE {?clcarea rdf:type corine:area ; corine:haslandcover?clc1?t1 ; corine:haslandcover?clc2?t2 ; clc:hasgeometry?geo. FILTER(strdf:contains(?geo, "POLYGON(( ,...))"^^strdf:WKT) FILTER(strdf:before(?t1,?t2))} The query described above performs a temporal join and a spatial selection. The spatial selection checks whether the geometry of an area is contained in the given polygon. The temporal join is used to capture the temporal evolution of the land cover in pairs of periods that preceed one another. Example 6. Update statement with temporal joins and period constructor. UPDATE {?area corine:haslandcover?clcarea?coalesced} WHERE {SELECT (?clcarea AS?area)?clcArea (strdf:period_union(?t1,?t2) AS?coalesced) WHERE {?clcarea rdf:type corine:area ; corine:haslandcover?clcarea?t1; corine:haslandcover?clcarea?t2. FILTER(strdf:meets(?t1,?t2) strdf:overlaps(?t1,?t2))}} In this update, we perform an operation called coalescing in the literature of temporal relational databases: two temporal triples with exactly the same subject, predicate and object, and periods that overlap or meet each other can be joined into a single triple with valid time the union of the periods of the original triples [4].

43 Query Engine stsparql to SPARQL 1.1 Translator Parser Strabon Storage Manager Named Graph Translator Repository PostgreSQL PostGIS PostgreSQL Temporal Optimizer Evaluator Transaction Manager SAIL RDBMS GeneralDB MonetDB Fig. 1. Architecture of the system Strabon enhanced with valid time support 4 Implementation of Valid Time Support in Strabon Figure 1 shows the architecture of the system Strabon presented in [14], as it has been extended for valid time support. We have added new components and extended existing ones as we explain below. As described in [14], Strabon has been implemented by extending Sesame and using an RDBMS as a backend. Currently, PostgreSQL and MonetDB can be used as backends. To support the geospatial functionality of stsparql efficiently as we have shown in [14], Strabon uses PostGIS, an extension of PostgreSQL for storing and querying spatial objects and evaluating spatial operations. To offer support for the valid time dimension of stsparql discussed in this paper, the following new components have been added to Strabon. Named Graph Translator. This component is added to the storage manager and translates the temporal triples of strdf to standard RDF triples following the named graphs approach of [22] as we discuss below. stsparql to SPARQL 1.1 Translator. This component is added to the query engine so that temporal triple patterns are translated to triple patterns as we discuss below. PostgreSQL Temporal. This is a temporal extension of PostgreSQL which defines a PERIOD datatype and implements a set of temporal functions. This datatype and its associated functions come very handy for the implementation of the valid time suport in Strabon as we will see below. PostgreSQL Temporal also allows the use of a GiST index on PERIOD columns. Using this add-on, PostgreSQL becomes temporally enabled as it adds support for storing and querying PERIOD objects and for evaluating temporal functions. Storing Temporal Triples. When a user wants to store strdf data in Strabon, she makes them available in the form of an N-Quads document. This document is decomposed into temporal triples and each temporal triple is processed separately by the storage manager as follows. First, the temporal triple is translated into the named graph representation. To achieve this, a URI is created and it is assigned to a named graph that corrresponds to the validity period of the 4

44 triple. To ensure that every distinct valid time of a temporal triple corresponds to exactly one named graph, the URI of the graph is constructed using the literal representation of the valid time annotation. Then, the stored triple in the named graph identified by this URI and the URI of the named graph is associated to its corresponding valid time by storing the following triple in the default graph: (g, strdf:hasvalidtime, t) where g is the URI of the graph and t is the corresponding valid time. For example, temporal triple corine:area_4 corine:haslandcover corine:naturalgrassland "[ T00:00:00, T00:00:00)"^^strdf:period will be translated into the following standard RDF triples: corine:area_4 corine:haslandcover corine:naturalgrassland corine: t00:00:00_ t00:00:00 strdf:hasvalidtime "[ T00:00:00, T00:00:00)"^^strdf:period The first triple will be stored in the named graph with URI corine: t00:00:00_ t00:00:00 and the second in the default graph. If later on another temporal triple with the same valid time is stored, its corresponding triple will end-up in the same named graph. For the temporal literals found during data loading, we deviate from the default behaviour of Sesame by storing the instances of the strdf:period datatype in a table with schema period values(id int, value period). The attribute id is used to assign a unique identifier to each period and associate it to its RDF representation as a typed literal. It corresponds to the respective id value that is assigned to each URI after the dictionary encoding is performed. The attribute value is a temporal column of the PERIOD datatype defined in PostgreSQL Temporal. In addition, we construct a GiST index on the value column. Querying Temporal Triples. Let us now explain how the query engine of Strabon presented in [14] has been extended to evaluate temporal triple patterns. When a temporal triple pattern is encountered, the query engine of Strabon executes the following steps. First, the stsparql to SPARQL 1.1 Translator converts each temporal triple pattern of the form s p o t into the graph pattern GRAPH?g s p o.?g strdf:hasvalidtime t. where s, p, o are RDF terms or variables and t is either a variable or an instance of the datatypes strdf:period or xsd:datetime. Then the query gets parsed and optimized by the respective components of Strabon and passes to the evaluator which has been modified as follows: If a temporal extension function is present, the evaluator incorporates the table period values to the query tree and it is declared that the arguments of the temporal function will be retrieved from the period values table. In this way, all temporal extension functions are evaluated in the database level using PostgresSQL Temporal. Finally, the RDBMS evaluation module has been extended so that the execution plan produced by the logical level of Strabon is translated into suitable SQL statements. The temporal extension functions are respectively mapped into SQL statements using the functions and operators provided by PostgreSQL Temporal.

45 5 Evaluation For the experimental evaluation of our system, we used two different datasets. The first dataset is the GovTrack dataset 5, which consists of RDF data about US Congress. This dataset was created by Civic Impulse, LLC 6 and contains information about US Congress members, bills and voting records. The second dataset is the CORINE Land Cover changes dataset that represents changes for the period [2000, UC), which we have already introduced in Section 2. The GovTrack dataset contains temporal information in the form of instants and periods, but in standard RDF format using reification. So, in the preprocessing step we transformed the dataset into N-Quads format. For example the 5 triples congress_people:a politico:hasrole _:node17d3oolkdx1. _:node17d3oolkdx1 time:from _:node17d3oolkdx2. _:node17d3oolkdx1 time:to _:node17d3oolkdx3. _:node17d3oolkdx2 time:at " "^^xs:date. _:node17d3oolkdx3 time:at " "^^xs:date. were transformed into a single quad: congress_people:a politico:hasrole _:node17d3oolkdx1 "[ T00:00:00, T00:00:00]"^^strdf:period. The transformed dataset has a total number of 7,900,905 triples, 42,049 of which have periods as valid time and 294,636 have instants. The CORINE Land Cover changes dataset for the time period [2000, UC) is publicly available in the form of shapefiles and it contains the areas that have changed their land cover between the years 2000 and Using this dataset, we created a new dataset in N-Quads form which has information about geographic regions such as: unique identifiers, geometries and periods when regions have a landcover. The dataset contains 717,934 temporal triples whose valid time is represented using the strdf:period datatype. It also contains 1,076,901 triples without valid times. Using this dataset, we performed temporal and spatial stsparql queries, similar to the ones provided in Section 3 as examples. Our experiments were conducted on an Intel Xeon E5620 with 12MB L3 caches running at 2.4 GHz. The system has 24GB of RAM, 4 disks of striped RAID (level 5) and the operating system installed is Ubuntu We ran our queries three times on cold and warm caches, for which we ran each query once before measuring the response time. We compare our system with the following implementations. The Prolog-based implementation of AnQL. We disabled the inferencer and we followed the data model and the query language that is used in [15], e.g., the above quad is transformed into the following AnQL statement: congress_people:a politico:hasrole _:node1 :[ , ]

46 AllegroGraph. AllegroGraph offers a set of temporal primitives and temporal functions, extending their Prolog query engine, to represent and query temporal information in RDF. AllegroGraph does not provide any high level syntax to annotate triples with their valid time, so, for example, the GovTrack triple that we presented earlier was converted into the following graph: congress_people:a politico:hasrole _:node1 graph: t.... graph: t... allegro:starttime " T00:00:00"^^xsd:dateTime. graph: t... allegro:endtime " T00:00:00"^^xsd:dateTime. As AllegroGraph supports the N-Quads format, we stored each triple of the dataset in a named graph, by assigning a unique URI to each valid time. Then, we described the beginning and ending times of the period that the named graph corresponds to, using RDF statements with the specific temporal predicates that are defined in AllegroGraph 7. We used the AllegroGraph Free server edition 8 that allows us to store up to five million statements, so we could not store the full version of the dataset. Naive implementation. We developed a baseline implementation by extending the Sesame native store with the named graph translators we use in Strabon so that it can store strdf graphs and query them using stsparql queries. We also developed in Java the temporal extension functions that are used in the benchmarks. A similar implementation has been used as a baseline in [14] where we evaluated the geospatial features of Strabon. We evaluate the performance of the systems in terms of query response time. We compute the response time for each query posed by measuring the elapsed time from query submission till a complete iteration over the results had been completed. We also investigate the scalability with respect to database size and complexity of queries. We have conducted four experiments that are explained below. Twenty queries were used in the evaluation. Only two queries are shown here; the rest are omitted due to space considerations. However, all datasets and the queries that we used in our experimental evaluation are publicly available 9. Experiment 1. In this experiment we ran the same query against a number of subsets of the GovTrack dataset of various size, as we wanted to test the scalability of all systems with respect to the dataset size. To achieve this, we created five instances of the GovTrack dataset, each one with exponentially increasing number of triples and quads. The query that is evaluated against these datasets is shown in Figure 2. Figure 3(a) shows the results of this experiment. As the dataset size increases, more periods need to be processed and as expected, the query response time grows for all systems. This is expected, as posing queries against a large dataset is challenging for memory-based implementations. Interestingly, the AnQL response time in the query Q2 is decreased, when a temporal filter is added to the temporal graph pattern of the query Q1. The use of a very selective temporal

47 stsparql AnQL AllegroGraph SELECT DISTINCT?x?name SELECT DISTINCT?x?name (select0-distinct (?x?name) WHERE {?x gov:hasrole?term?t. WHERE {?x gov:hasrole?term?t. (q?x!gov:hasrole?term?t) OPTIONAL {?x foaf:name?name.} OPTIONAL {?x foaf:name?name.} (optional (q?x!foaf:name?name)) FILTER(strdf:after(?t, [...] ˆˆstrdf:period))} FILTER(beforeany([[...]],?t))} (interval-after-datetimes?t... )) Fig. 2. Query of Experiment 1. filter reduces the number of the intermediate results. Also, it the implementation of AnQL performs better in workloads of up to 100,000 triples and quads, as it is a memory-based implementation. The poor performance of the baseline implementation compared to Strabon is reasonable, as Strabon evaluates the temporal extension functions in the RDBMS level using the respective functions of PostgreSQL Temporal and a GiST index on period values, while in the case of the baseline implementation a full scan over all literals is required. AllegroGraph is not charged with the cost of processing the high level syntax for querying the valid time of triples, like the other implementations, therefore it stores two triples to describe each interval of the dataset. This is one of the reasons that it gets outperformed by all other implementations. One can observe that Strabon achieves better scalability in large datasets than the other systems due to the reasons explained earlier. The results when the caches are warm are far better, as the intermediate results fit in main memory, so we have less I/O requests. Experiment 2. We carried out this experiment to measure the scalability of all systems with respect to queries of varying complexity. The complexity of a query depends on the number and the type of the graph patterns it contains and their selectivity. We posed a set of queries against the GovTrack dataset and we increased the number of triple patterns in each query. As explained earlier, the AllegroGraph repository contains five million statements. First, in Q2, we have a temporal triple pattern and a temporal selection on its valid time. Then, Q3 is formed by adding a temporal join to Q2. Then Q4 and Q5 are formed by adding some more graph patterns of low selectivity to Q3. Queries with low selectivity match with large graphs of the dataset and as a result the response time increases. This happens basically because in most cases the intermediate results do not fit in the main memory blocks that are available, requiring more I/O requests In the queries Q6 and Q7 we added graph patterns with high selectivity to the previous ones and the response time was decreased. This happened because of the highly selective graph patterns used. The respective response times in warm caches are far better, as expected. What is interesting in this case, is that while in cold caches the reponse time slightly increases from the query Q6 to the query Q7, in warm caches it decreases. This happens because with warm caches, the computational effort is widely reduced and the response time is more dependent of the number of the intermediate results which are produced. The query Q7 produces less intermediate results because it is more selective than Q6. AllegroGraph has the best performance in Q2, which contains only a temporal triple pattern, but when temporal functions are introduced (queries Q3-Q7), it performs worse than any other implementation. Obviously, the evaluation of a temporal join is very costly, as it internally

48 Time (milliseconds) Strabon(warm cache) Strabon(cold cache) Naive AnQL AllegroGraph Number of triples and quads (a) System Q1 Q2 Q3 Q4 Q5 Q6 Q7 Strabon (warm caches) Strabon (cold caches) Naive AnQL AllegroGraph (b) System Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Strabon (warm caches) Strabon (cold caches) Naive (c) System Q17 Q18 Q19 Q20 Strabon (warm caches) Strabon (cold caches) Naive (d) Fig. 3. (a) Experiment 1: Query response time with respect to dataset size. (b), (c), (d) Experiments 2, 3, 4: Query response time in milliseconds. maps the variables that take part in the temporal join to the respective intervals of the dataset, retrieves their beginning and ending timestamps and then evaluates the temporal operators. The AnQL implementation performs very well in queries of low selectivity but in queries of high selectivity it is outperformed by the baseline implementation. Strabon, even with cold caches, performs significantly better than the other implementations due to the efficient evaluation of the queries in the database level and the use of a temporal index. Experiment 3. In this experiment we posed temporal queries against the Gov- Track dataset in order to test the performance of different temporal operators in the FILTER clause of the query that are typically used to express a temporal join. The triple patterns in the queries posed (Q8-Q16) are identical so the queries differ only in the temporal function used in the FILTER clause of the query. For example query Q8 is the following: SELECT DISTINCT?x1?x2 WHERE {?x1 gov:hasrole?term?t1.?x2 gov:hasrole?term?t2. FILTER(strdf:during(?t1,?t2))} The results of the experiment are shown in the table of Figure 3(c). For each system, the differences in performance with respect to the different temporal operators used in queries are minor, especially in the case of the naive implementation. As expected, Strabon continues to perform much better than the naive implementation as the implementation of each operator is more efficient. Experiment 4. In this experiment we evaluate the spatiotemporal capabilities of Strabon and the baseline implementation. We used the CORINE Land Cover changes dataset. This is a spatiotemporal dataset that contains more temporal triples, but there are only two distinct valid time values. Query Q17 retrieves the valid times of the temporal triples, while query Q18 is more selective

49 and performs a temporal join. Query Q19 is similar to Q20 but it also retrieves geospatial information so the response time is increased. Query 20 performs a temporal join and a spatial selection, so the reponse time is increased for both systems. Strabon peforms better because the temporal and the spatial operations are evaluated in the database level and the respective indices are used, while in the naive implementation these functions are implemented in Java. 6 Related Work To the best of our knowledge, the only commercial RDF store that has good support for time is AllegroGraph 10. AllegroGraph allows the introduction of points and intervals as resources in an RDF graph and their situation on a time line (by connecting them to dates). It also offers a rich set of predicates that can be used to query temporal RDF graphs in Prolog. As in stsparql, these predicates include all qualitative relations of [16] involving points and intervals. Therefore, all the temporal queries expressed using Prolog in AllegroGraph can also be expressed by stsparql in Strabon. In [7] another approach is presented for extending RDF with temporal features, using a temporal element that captures more than one time dimensions. A temporal extension of SPARQL, named T -SPARQL, is also proposed which is based on TSQL2. Also, [17] presents a logic-based approach for extending RDF and OWL with valid time and the query language SPARQL for querying and reasoning with RDF, RDFS and OWL2 temporal graphs. To the best of our knowledge, no public implementation of [7] and [17] exists that we could use to compare with Strabon. Similarly, the implementations of [19] and [22] are not publicly available, so they could not be included in our comparison. In strdf we have not considered transaction time since the applications that motivated our work required only user-defined time and valid time of triples. The introduction of transaction time to strdf would result in a much richer data model. We would be able to model not just the history of an application domain, but also the system s knowledge of this history. In the past the relevant rich semantic notions were studied in TSQL2 [20], Telos (which is very close to RDF) [18] and temporal deductive databases [21]. 7 Conclusions In future work, we plan to evaluate the valid time functionalities of Strabon on larger datasets, and continue the experimental comparison with AllegroGraph as soon as we obtain a license of its Enterprise edition. We will also study optimization techniques that can increase the scalability of Strabon. Finally, it would be interesting to define and implement an extension of stsparql that offers the ability to represent and reason with qualitative temporal relations in the same way that the Topology vocabulary extension of GeoSPARQL represents topological relations. 10

50 References 1. Open Geospatial Consortium. OGC GeoSPARQL - A geographic query language for RDF data. OGC Candidate Implementation Standard (2012) 2. Al-Kateb, M., Ghazal, A., Crolotte, A., Bhashyam, R., Chimanchode, J., Pakala, S.P.: Temporal Query Processing in Teradata. In: ICDT (2013) 3. Allen, J.F.: Maintaining knowledge about temporal intervals. CACM 26(11) (1983) 4. Boelen, M.H., Snodgrass, R.T., Soo, M.D.: Coalescing in Temporal Databases. IEEE CS 19, (1996) 5. Clifford, J., Dyreson, C., Isakowitz, T., Jensen, C.S., Snodgrass, R.T.: On the semantics of now in databases. ACM TODS 22(2), (1997) 6. Date, C.J., Darwen, H., Lorentzos, N.A.: Temporal data and the relational model. Elsevier (2002) 7. Grandi, F.: T-SPARQL: a TSQL2-like temporal query language for RDF. In: International Workshop on Querying Graph Structured Data. pp (2010) 8. Gutierrez, C., Hurtado, C., Vaisman, R.: Temporal RDF. In: Gmez-Prez, A., Euzenat, J. (eds.) ESWC. LNCS, vol. 3532, pp Springer (2005) 9. Gutierrez, C., Hurtado, C.A., Vaisman, A.: Introducing Time into RDF. IEEE TKDE 19(2), (2007) 10. Güting, R.H., Böhlen, M.H., Erwig, M., Jensen, C.S., Lorentzos, N.A., Schneider, M., Vazirgiannis, M.: A foundation for representing and querying moving objects. ACM TODS 25(1), 1 42 (2000) 11. Hurtado, C.A., Vaisman, A.A.: Reasoning with Temporal Constraints in RDF. In: Alferes, J., Bailey, J., May, W., Schwertel, U. (eds.) PPSWR. LNCS, vol. 4187, pp Springer (2006) 12. Koubarakis, M., Karpathiotakis, M., Kyzirakos, K., Nikolaou, C., Sioutis, M.: Data Models and Query Languages for Linked Geospatial Data. In: Eiter, T., Krennwallner, T. (eds.) RR. LNCS, vol. 7487, pp Springer (2012) 13. Koubarakis, M., Kyzirakos, K.: Modeling and Querying Metadata in the Semantic Sensor Web: The Model strdf and the Query Language stsparql. In: Aroyo, L., et al. (eds.) ESWC. LNCS, vol. 6088, pp Springer (2010) 14. Kyzirakos, K., Karpathiotakis, M., Koubarakis, M.: Strabon: A Semantic Geospatial DBMS. In: Cudr-Mauroux, P., et al. (eds.) ISWC. LNCS, vol. 7649, pp Springer (2012) 15. Lopes, N., Polleres, A., Straccia, U., Zimmermann, A.: AnQL: SPARQLing Up Annotated RDFS. In: Patel-Schneider, P., et al. (eds.) ISWC. LNCS, vol Springer (2010) 16. Meiri, I.: Combining qualitative and quantitative constraints in temporal reasoning. Artificial Intelligence 87(1-2), (1996) 17. Motik, B.: Representing and Querying Validity Time in RDF and OWL: A Logic- Based Approach. Journal of Web Semantics 12 13, 3 21 (2012) 18. Mylopoulos, J., Borgida, A., Jarke, M., Koubarakis, M.: Telos: representing knowledge about information systems. ACM TIS (1990) 19. Perry, M.: A Framework to Support Spatial, Temporal and Thematic Analytics over Semantic Web Data. Ph.D. thesis, Wright State University (2008) 20. Snodgrass, R.T. (ed.): The TSQL2 Temporal Query Language. Springer (1995) 21. Sripada, S.M.: A logical framework for temporal deductive databases. In: Bancilhon, F., DeWitt, D. (eds.) VLDB. pp M. Kaufmann Publ. Inc. (1988) 22. Tappolet, J., Bernstein, A.: Applied Temporal RDF: Efficient Temporal Querying of RDF Data with SPARQL. In: Aroyo, L., et al. (eds.) ESWC. LNCS, vol. 5554, pp Springer-Verlag (2009)

51 TELEIOS FP Additional Implementations In this chapter we will describe tools that have been developed in the context of TELEIOS but they were not included in the description of work. These tools are the following: The visual query builder for stsparql Sextant, a web-based tool for exploring the ocean of linked geospatial data and its temporally enabled extension, called SexTant which extends the previous version with the capability of visualizing geospatial features with temporal dimension on both a map and a timeline. Sextant has been reported in the demo paper Browsing and Mapping the Ocean of Linked Geospatial Data that was presented at ESWC2013 and was the co-winner of the best demo award. SexTant has been reported in the demo paper Visualizing Time-Evolving Linked Geospatial Data and will be demonstrated at ISWC The Visual Query Builder Linked geospatial data has received increased attention both from researchers and practitioners. At the same time, systems have been developed that store and manage this data, like the RDF store Strabon. However, the type of users that can take advantage of these systems is limited because it requires expert knowledge of a query language like SPARQL. This might prevent users without the expert knowledge to use the data stored in these systems although it would be beneficial for them. In the scope of the TELEIOS project, we designed and developed a tool that offers intuitive editing functionalities for constructing complex queries in stsparql through a graphical user interface. This tool, called the Visual Query Builder, demands no detailed knowledge of a query language, but at the same time does not limit more experienced users to create queries directly into stsparql.the Query Builder provides users with two choices. The choice to construct a query using a graphical user interface through the Facet-Graph based Query Editor and the choice to write a query using stsparql through the text based Query Editor. The tool has the ability to convert a graph based query into a stsparql query, giving users with some experience with SPARQL but not enough confidence to write a query directly in stsparql, the freedom to make any modifications to their query in case they cannot express exactly what they want through the Facet-Graph based Query Editor. 4.2 Functionality Overview The query options of the Facet-Graph based Editor are derived from one or more ontologies that the Query Builder takes as input and describe the semantics and the structure of the queried data sets. Based on these options, the user creates a query in the form of a graph. Afterwards, if the user decides to send the query he constructed to the specified endpoint of Strabon, in order to get back the corresponding results, the Query Builder transforms this graph query to an actual stsparql query. This query is sent to Strabon, where is processed 4.3 Architecture and Design The Query Builder is designed as a Java based Web Application that consists of two main components, the Query Buider UI and the Query Builder Server, which are described in detail in the D4.3 The evaluation of the developed implementation 44

52 TELEIOS FP following two subsections. The system overview of the Query Builder is shown in Figure 1. There are not all of the components implemented yet as some of them are planned as future work and are described in the last section. The two main functionalities of the Query builder can be summarized as follows: Construction of complex query graphs through a graphical user interface, which represent stsparql queries Transformation of the complex query graphs into the corresponding stsparql queries However, these functionalities are based and built upon a more basic and abstract design. The core design of the Query Builder consists of an abstract model that offers to the tool the ability to become easily expandable and adaptable to other query languages and database types. This model describes, through interfaces and services, in an abstract and general manner how a component for a specific query language, a specific database type or the conversion between a graph-query and a text-query should be implemented. This is facilitated by the use of the Eclipse Equinox OSGi container, which allows us to structure the system as a set of modular Java OSGi bundles on top of it. Each component is an OSGi bundle and exposes its functionality through OSGi and Spring services, which make easy the change of the system components, if there is a need for a different implementation. 4.4 Query Builder UI The Query Builder UI is the JavaScript based front-end part of the Query Builder. Though it is jointly developed in Java along with the Query Builder Server, it is mostly developed with the use of the Vaadin Web application framework. Vaadin uses Java as the programming language, the same language we used for the implementation of our server. The web content is rendered by the use of the Google Web Toolkit, which translates the Java code in Javascript. Thus, Vaadin offered us the advantage of using Java in order to finally produce a rich Internet application to our client/browser side. The Query Builder UI is intended to be compatible with any up to date Standard Web Browser without the need for additional Browser plugins (as e.g. Adobe Flash). The UI options that are provided for query construction are derived from an ontology (i.e. the domain ontologies developed for the NOA and DLR Use Case)which describes the semantics of the content of the data sets to which the constructed query will refer to. Thus only few assumptions concerning the structure of the queried data are made by the Query Builder implementation itself. The benefit of this approach is that the Query Builder can be adapted to different Use Cases by swapping the underlying domain ontologies. In order to provide the possibility to different kind of users to construct stsparql queries, the Query Builder offers two different query editors. The Facet-Graph-based Query Editor addresses users who are not familiar with stsparql and the Text-based Query Editor can be useful for users who are already experienced with SPARQL. As an example, in Figures 5 and 6, we provide a query that represents the request for patches in the area of Piraeus, which are labeled with the category class of Water (i.e. patches that contain water bodies). The Facet-Graph-based Query Editor as shown in Figure 5 enables users without knowledge of stsparql to express complex queries as a graph. This graph consists of FacetBoxes D4.3 The evaluation of the developed implementation 45

53 TELEIOS FP resembling the vertices of the graph that either refer to specific classes defined in the underlying ontologies or to concrete geospatial features, such as polygons or line strings, that can be defined by the user. FacetBoxes also enable the user to specify simple constraints on the range of various properties (mostly DataProperties) of the referenced class. These FacetBoxes are connected by PropertyEdges that represent Object- or Data Properties defined in the underlying ontologies. The text based Query Editor as shown in figure below can be used with a more expert background in SPARQL to either directly create custom stsparql queries or to adapt queries that have previously been created in the Facet-Graph-based Query Editor and converted to stsparql. 4.5 Query Builder Server The Query Builder server is structured as a set of modular Java OSGi bundles on top of the Eclipse Equinox OSGi container, which is the Eclipse implementation of the OSGi framework specifications and is a module runtime that allows developers to implement an application as a set of bundles, using common services and infrastructure, something that does not exist in standalone Java environments. This gave us the ability to construct our system as a set of such bundles and express the functionality of each bundle through services and thus we can easily swap these bundles/components in order to give to our system different functionality or add new features in order to adapt or integrate the Query Builder system to changed or entirely different use cases. Among others, the Query Builder server consists of the following most significant subcomponents: OWL Model Service Query Model Transformation Storage Service The OWL Model Service is dedicated to serve as a data source for the model definition which is obtained through ontologies that can be either retrieved from the local filesystem or by connecting to a remote repository (i.e. Strabon). The OWL Model Service can be used as a data access object to retrieve individual Properties and Classes of the currently used Model. While the Model itself is constructed out of a set of OWL based ontologies, the model service encapsulates the access to them through more generalized Java interfaces that are independent of the specifics of OWL based model descriptions. Along with the ontologies that the tool takes as input, there is also an internal ontology, which describes the spatial constraints that stsparql supports and enables the tool to express spatial queries. The Query Model Transformation component is used to transform the Facet-Graph based ouery representations to query strings formulated in a certain query language (i.e. stsparql). Then, the generated query strings can be sent to an endpoint on top of a well-known spatially enabled RDF store (i.e. Strabon) and retrieve the results. The Storage Service provides an interface for retrieving and persisting query representations. The Storage Service implementation relies on an embedded Neo4j Graph database that can store arbitrary property graphs without the requirement to define a rigid schema for the stored data or inter process communication by the aid of a separate protocol or query language, thus it can be used as a basis for an easy to implement object storage. D4.3 The evaluation of the developed implementation 46

54 TELEIOS FP Future work The Visual Query Builder offers the ability to SPARQL-inexperienced users to create stsparql queries with the use of boxes and arrows and to convert them to valid query strings. However, as shown in Figure 3, the tool is designed to provide to its users more options. Future work includes the implementation of a User Management / Access Control subcomponent, which will be dedicated to serve as a data source for user accounts and groups. This will enable the system to associate saved query models with the users and groups they belong to. In addition, this subcomponent will evaluate user credentials and will manage the access to specific features of the Query Builder system (i.e. loading, editing and saving query representations) in context of the logged in user. The storage of the user and group data could be realized through an external LDAP server. Furthermore, the Query UI can be extended in order to provide a JavaScript API for client-side integration with other HMI components, through an extensible JavaScript prototype object, that will allow bidirectional, event driven communication between the Query builder UI and other HMI components. Thus, queries can be edited inside the Query Builder while changes to the actual Query String could be reported instantly to other components, which could then update their appearance by retrieving and processing the result set of the changed query. D4.3 The evaluation of the developed implementation 47

55 Sextant: Browsing and Mapping the Ocean of Linked Geospatial Data Charalampos Nikolaou, Kallirroi Dogani, Kostis Kyzirakos, and Manolis Koubarakis National and Kapodistrian University of Athens, Greece Abstract. Linked geospatial data has recently received attention as researchers and practitioners have started tapping the wealth of geospatial information available on the Web. With the rapid population of the Web of data with geospatial information, applications to manage it have also started to emerge. What the semantic geospatial web lacks, though, compared to the technological arsenal of the traditional GIS area are the tools that aid researchers and practitioners in making use of this ocean of geospatial data. In this demo paper, we present Sextant, a web-based tool that enables exploration of linked geospatial data as well as creation, sharing, and collaborative editing of thematic maps by combining linked geospatial data and other geospatial information available in standard OGC file formats. 1 Introduction and Motivation Linked geospatial data has recently received attention as researchers and practitioners have started tapping the wealth of geospatial information available on the Web [4]. As a result, in the last few years, the Web of data is being rapidly populated with geospatial information. Ordnance Survey is the first national mapping agency that has made various kinds of geospatial data from Great Britain available as open linked data 1. Another representative example of such efforts is LinkedGeoData ( where OpenStreetMap (OSM) data are made available as RDF and queried using the declarative query language SPARQL [1]. A similar effort is GeoLinked Data ( where geospatial data from Spain is made public using RDF [6]. Finally, the Greek Linked Open Data portal ( that is being developed by our group has recently published a number of open datasets for the area of Greece as linked data. Applications for exploiting this abundance of geospatial information have also started to emerge. In project TELEIOS 2 we have built a service for real-time fire monitoring using semantic web and linked data technologies [3]. The core This work was supported by the European FP7 project TELEIOS (257662) and the Greek NSRF project SWeFS (180)

56 component of this service is the geospatial RDF store Strabon [5] which stores hotspot products extracted from satellite images and linked geospatial data offered by the Greek Linked Open Data portal mentioned previously and queries it to deliver the service. Strabon supports two query languages: stsparql, an extension of SPARQL 1.1 for querying linked geospatial data that changes over time [5] and GeoSPARQL [7], a recent OGC standard for static geospatial data. What the semantic geospatial web lacks, though, compared to the technological arsenal of the traditional GIS area are the tools that aid researchers and practitioners in making use of this ocean of geospatial data. Although tools for exploring the content of linked geospatial data have made their appearance recently, these tools focus on browsing a single dataset or the content of a single SPARQL endpoint. The LinkedGeoData browser 3 for example, being the most promising among these tools, offers browsing functionality for OSM data and also editing capabilities in the style of the original OSM portal. Map4RDF 4 is another browser for RDF data with geospatial information which also supports the visualization of statistical data modeled according to the SCOVO vocabulary. Although such tools are very useful for exploring the data offered by a single SPARQL endpoint, they present severe limitations when they are faced with the task of exploring the linked geospatial data cloud. This problem has been tried to be eliminated in its general form (not only in the geospatial domain) by the recent tool LODVisualization 5 which is based on the Linked Data Visualization Model [2] for visualizing RDF data. Although LODVisualization seems a very promising tool, it has very limited support regarding visualization of geospatial data and construction of meaningful thematic maps. With the aim of filling this gap and going beyond data exploration to map creation and sharing, we designed and developed Sextant 6. Similar to well-known GIS tools (e.g., ArcGIS, QGIS), Sextant can be used to produce thematic maps by layering geospatial information which exists in a number of data sources ranging from standard SPARQL endpoints, to SPARQL endpoints following OGC standards for the modeling and querying of geospatial information (i.e., GeoSPARQL), or even other standards or well-adopted geospatial file formats, such as KML and GeoJSON. The feature that distinguishes Sextant from other semantic web or GIS tools is that map creation and sharing, as well as exploration of data can be done in a declarative way using the query languages stsparql or GeoSPARQL. In this sense, Sextant is able to create useful thematic maps by layering information coming from the evaluation of SPARQL queries. 2 Sextant Overview Figure 1a gives a high-level overview of Sextant. The two main functionalities of Sextant can be summarized as follows:

57 Exploration of linked geospatial data spanning multiple SPARQL endpoints. Creation, sharing, and collaborative editing of thematic maps by combining linked geospatial data and other geospatial information available in vector or raster formats, such as KML, GeoJSON, and GeoTIFF. (a) (b) Fig. 1: (a) Sextant overview and (b) Map ontology Sextant delivers the above functionality by modeling the content of a map according to the Map ontology of Figure 1b. In terms of exploration, Sextant harvests the content of a SPARQL endpoint to build a class hierarchy and discover the spatial extent of available information (assuming that the endpoint supports either stsparql or GeoSPARQL). According to the Map ontology, a map comprises an ordered set of layers the content of which may derive either from the evaluation of an stsparql/geosparql query on a SPARQL endpoint or directly from a standard file format for representing geospatial information, such as KML. A layer may also derive from the evaluation of standard SPARQL queries using other vocabularies such as the W3C GeoXG 78 and Neo- Geo ( In such cases, however, one has to provide an adapter for transforming the representation of a geometry to a standard OGC representation format, such as Well-Known Text or GML, the two OGC geospatial data formats supported by stsparql and GeoSPARQL. Population of the Map ontology results in the production of a map as a web resource which can be shared with others for collaborative editing and viewing in Sextant. Collaboration can be achieved also using well-known desktop or web-based tools (QGIS, ArcGIS, Google Maps, OpenLayers) by leveraging the export facility of Sextant. The main design goals in the development of Sextant have been flexibility, portability, and interoperability. Sextant achieves these goals based on the following technologies. It has been developed in the Google Web Toolkit frame The W3C GeoXG incubator group has also developed a geospatial ontology available at

It is also based on OGC standards for the representation and querying of geospatial information (GeoSPARQL, KML, WKT), thus making it easily interoperable with well-established and mature GIS

58 (a) ZKI (b) Sextant Fig. 2: Fire products of 2009 in the prefecture Attica, Athens, Greece work9 which is a mature framework that provides rapid cross-platform and crossbrowser implementations of web applications. It is also based on OGC standards for the representation and querying of geospatial information (GeoSPARQL, KML, WKT), thus making it easily interoperable with well-established and mature GIS applications. 3 Demonstration Overview To demonstrate Sextant, we will present how one could explore and then create a map using linked geospatial data spanning multiple SPARQL endpoints and KML files. The demonstration will be based on a real user scenario in which an expert of an emergency response agency like the Center of Satellite Based Crisis Information10 (ZKI) or the National Observatory of Athens11 (NOA) would like to quickly compile a map in response to an emergency event, for example a flood, a tsunami, an earthquake, a forest fire. A typical map product that ZKI produced on August 25, 2009 for assessing the area burnt as a result of the fires of the period from August 21 to August 24, 2009 in Attica, Greece is presented in Figure 2a and may be found at Such a map, apart from depicting burnt areas, it is annotated with other kinds of information to aid experts in assessing the severity of the fire event. Typical information includes the road network, land use/cover, toponyms, population, and other more detailed information concerning map production and the data sources that were used. We will use Sextant to produce the map presented in Figure 2b that is similar to the one produced by ZKI above. The major difference between the two maps is that the one produced by Sextant is based solely on Linked Open Geospatial

59 Data that are publicly available on the web. This will allow an emergency response manager to quickly compile a map based on open source intelligence until more precise, detailed data becomes available, i.e., after contacting local authorities. Sextant will communicate with three endpoints on top of well-known spatially-enabled RDF stores (Strabon, Parliament, and Virtuoso) and also KML files that are publicly available on the web to get most of the information mentioned above. To demonstrate the collaborative editing capabilities of Sextant, we will share the URI corresponding to the produced map with another user who will be able to edit the map. The changes will be reflected back to the creator of the map. 4 Conclusions and Future Work In this work, we presented Sextant, a web-based tool that offers functionalities which are of fundamental importance to leveraging linked geospatial data: a) exploration of linked geospatial data that span across multiple SPARQL endpoints, b) creation, sharing, and collaborative editing of thematic maps produced by querying the linked geospatial data cloud and other geospatial data sources available in OGC standard file formats, c) export of maps in OGC standard formats for viewing and editing using other GIS applications. In the future, we plan to extend Sextant in two directions. First, we will enable the creation of a single layer of a map by combining geospatial information present in different SPARQL endpoints. Such a feature would turn Sextant into a very powerful and desirable tool for leveraging the linked geospatial data cloud. In principle, such functionality could have been easily realized by the existence of both a spatially-enabled and federated RDF store. Second, we will extend map rendering, which currently uses Google Maps, to use the more versatile OpenLayers project. References 1. Auer, S., Lehmann, J., Hellmann, S.: LinkedGeoData: Adding a Spatial Dimension to the Web of Data. In: International Semantic Web Conference. pp (2009) 2. Fernéandez, J.M.B., Auer, S., Garcia, R.: The linked data visualization model. In: International Semantic Web Conference (Posters & Demos) (2012) 3. Koubarakis, M., et al.: Real-time wildfire monitoring using scientific database and linked data technologies. In: In the 16th International Conference on Extending Database Technology (EDBT 2013). Genoa, Italy (March ) 4. Koubarakis, M., Karpathiotakis, M., Kyzirakos, K., Nikolaou, C., Sioutis, M.: Data Models and Query Languages for Linked Geospatial Data. Invited papers from 8th Reasoning Web Summer School (2012) 5. Kyzirakos, K., Karpathiotakis, M., Koubarakis, M.: Strabon: A Semantic Geospatial DBMS. In: 11th International Semantic Web Conference (2012) 6. León, A.d., Saquicela, V., Vilches, L.M., Villazón-Terrazas, B., Priyatna, F., Corcho, O.: Geographical Linked Data: a Spanish Use Case. In: I-SEMANTICS. ACM (2010) 7. Open Geospatial Consortium: OGC GeoSPARQL - A geographic query language for RDF data. OGC R Implementation Standard (2012)

60 SexTant: Visualizing Time-Evolving Linked Geospatial Data Konstantina Bereta 1, Charalampos Nikolaou 1, Manos Karpathiotakis 2, Kostis Kyzirakos 1, and Manolis Koubarakis 1 1 National and Kapodistrian University of Athens, Greece 2 École Polytechnique Fédérale de Lausanne, Switzerland Abstract. We present SexTant, a Web-based system for the visualization and exploration of time-evolving linked geospatial data and the creation, sharing, and collaborative editing of temporally-enriched thematic maps which are produced by combining different sources of such data. 1 Introduction and Motivation Linked geospatial data has recently received attention as researchers and practitioners have started tapping the wealth of geospatial information available in the archives of various national cartographic agencies and making it available on the Web as linked data [2]. As a result, in the last few years, the Web of data is being rapidly populated with geospatial information. As the real-world entities represented in linked geospatial datasets evolve over time, the datasets themselves get updated and both the spatial and the temporal dimension of data become significant for users. In the demo paper [4] we presented Sextant 3, a tool that enables the visualization and exploration of the spatial dimension of linked geospatial data. Sextant enables map creation and sharing, as well as the visualization and exploration of data by evaluating GeoSPARQL queries on SPARQL endpoints. In this way rich thematic maps can be created by layering information coming from the evaluation of GeoSPARQL queries. Sextant is based on standards defined by the Open Geospatial Consortium (OGC), thus it is interoperable with other well-known GIS and Web tools such as Google Earth. In this demo paper we turn our attention to the temporal dimension of linked geospatial data and present a new version of Sextant that we now rename Sex- Tant (the capital T in the new name emphasizes the time dimension). SexTant extends the functionalities of the earlier tool by visualizing the temporal dimension of data having a spatial extent simultaneously on a map and a timeline. The new capabilities of SexTant build on the temporal features of the data model This work was supported by the EU FP7 project TELEIOS (257662), the Greek NSRF project SWeFS (180), and the EU project Optique (318338). 3 See for the explanation of the name.

61 strdf, the query language stsparql, and their efficient implementation in the geospatial RDF store Strabon [1]. strdf and stsparql go beyond the OGC standard GeoSPARQL by allowing the representation and querying of linked geospatial data that changes over time [1,3]. SexTant (and this demo paper) extends the research presented in [1] by demonstrating how graphs defined in strdf/stsparql can be explored and visualized. In related work, the French project GEOPEUPLE is also studying the modeling and visualization of spatiotemporal data. As an example, the demo available at visualizes the evolution of administrative regions in France over the time. 2 New Functionalities of SexTant SexTant extends the architecture of our earlier system presented in [4] and shown in Fig. 1a as follows (new and modified components are highlighted with pink boxes). First, apart from the the map ontology used by the earlier system and shown in Fig. 1b, SexTant employs the temporal ontology dictated by strdf and stsparql for the modeling of valid time [1]. This ontology enables the introduction of user-defined time and valid time of a triple in strdf data. Times are modelled as instants or intervals and are represented using values of the datatypes xsd:datetime and strf:period respectively. Second, one can now use all the temporal features of stsparql to query linked spatiotemporal data encoded in strdf. In this way the full capabilities of endpoints using the spatiotemporal RDF store Strabon can be exploited. Third, the module that translates the results of stsparql queries from XML to KML format has been extended so that the temporal primitives of strdf that we mentioned above are translated into the respective temporal primitives of the KML standard. An example of this transformation is provided in Fig. 1c. Last, SexTant builds on the Timemap Javascript library ( for visualizing temporally-enriched KML files. This enables the visualization of geospatial features with associated temporal information on a map and a timeline simultaneously. Timemap has been transparently integrated in the implementation of the earlier system which is based on the Google Web Toolkit framework. 3 Demonstration Overview The demonstration of the spatio-temporal features of SexTant will be based on a real scenario in which an Earth Observation (EO) scientist studies the changes in the land cover of an area and assesses the damage caused by fires. This scenario is very common in the EO domain, where data is constantly produced by satellite sensors and is associated with metadata containing, among others, temporal attributes, such as the time that an image was acquired. Satellite acquisitions are utilized in related applications such as the CORINE Land Cover

(a) (b) SPARQL query SELECT DISTINCT?area?t (strdf:transform(?geometry, <http://www.opengis.net/def/crs/epsg/0/4326>) as?geo ) WHERE {?area clc:haslanduse clc:sclerophyllousvegetation?t?area clc:hasgeometry?

.. SPARQL XML results <kml xmlns='http://www.opengis.net/kml/2.2'> <result> <Folder> <binding name='area'> <Placemark> <uri>http://www.linkedopendata.

gr/ontology#wkt'> </TimeSpan> <name>result0</name> POLYGON((21.821 38.283,21.821 38.282,...)) <Polygon> </literal> <outerboundaryis> </binding> <LinearRing> <binding name='t'> <coordinates>21.821,38.

gr/ontology#period'> </outerboundaryis> [2000-01-01T00:00:00,2012-09-30T00:00:00) </Polygon> </literal> <ExtendedData> </binding> <Data name='t'> </result>

gr/corinearea_34063</value> </Data> SPARQL XML results </ExtendedData> to KML ﬁle format </Placemark> </Folder> </kml> Query evalua9on (c) Fig.

cover of European areas over time. To achieve the goal of our scenario, we will combine information derived from the following datasets that were produced within the project TELEIOS (http: //www.

62 (a) (b) SPARQL query SELECT DISTINCT?area?t (strdf:transform(?geometry, < as?geo ) WHERE {?area clc:haslanduse clc:sclerophyllousvegetation?t?area clc:hasgeometry?geometry.?ba rdf:type noa:burnedarea?t2.?ba noa:hasgeometry?geometry2. FILTER(strdf:mbbIntersects(?geometry,?geometry2)) FILTER(strdf:before(?t,?t2)) } <?xml version='1.0' encoding='utf-8'?>... SPARQL XML results <kml xmlns=' <result> <Folder> <binding name='area'> <Placemark> <uri> <TimeSpan> </binding> <begin> t00:00:00</begin> <end> t00:00:00</end> <binding name='geo'>. <literal datatype=' </TimeSpan> <name>result0</name> POLYGON(( , ,...)) <Polygon> </literal> <outerboundaryis> </binding> <LinearRing> <binding name='t'> <coordinates>21.821, , </coordinates> </LinearRing> <literal datatype=' </outerboundaryis> [ T00:00:00, T00:00:00) </Polygon> </literal> <ExtendedData> </binding> <Data name='t'> </result> <value>[ t00:00:00, t00:00:00)</value>... </Data> <Data name='area'> <value> </Data> SPARQL XML results </ExtendedData> to KML ﬁle format </Placemark> </Folder> </kml> Query evalua9on (c) Fig. 1: (a) SexTant overview and (b) Map ontology (c) Translation of SPARQL XML results to KML programme operated by the European Environment Agency that makes available as a cartographic product the land cover of European areas over time. To achieve the goal of our scenario, we will combine information derived from the following datasets that were produced within the project TELEIOS (http: // the CORINE Land Cover dataset of year 2000, a Fire Hotspots dataset that provides information about fire hotspots in Greece, and a Burned Areas dataset that provides detailed information about areas of Greece that have been affected by fires during a recent fire season. The EO scientist will use SexTant to visualize the results of stsparql queries that use several thematic, spatial and temporal criteria so that she will be able to derive implicit links among the involved datasets due to their spatial and temporal correlation. In our scenario, first we will visualize on a map the areas that have been classified as sclerophyllous vegetation according to the CORINE Land Cover dataset of year The valid time of the triples that encode information about these areas will be projected on the timeline. Next, a new layer that visualizes the hotspots that have been identified during the fire season will be displayed on the map while the timeline will display the time when the hotspots were detected. Then, a new layer that depicts the areas that were burned during the forest fires of 2012 will be overlayed on the map and the timeline. The resulting map will display to the EO scientist the schlerophyllous forests that got burnt by

Fig. 2: A screenshot from SexTant depicting the evolution of the land cover the forest fires of 2012 along with a preview of the evolution of the forest fires as they were detected by satellites so

63 Fig. 2: A screenshot from SexTant depicting the evolution of the land cover the forest fires of 2012 along with a preview of the evolution of the forest fires as they were detected by satellites so that she can assess the severity of the damage caused by fires. A similar procedure will be used in order to discover implicit links among the datasets enriched with provenance information, e.g., discover the cause of changes in the land cover of areas as represented in CORINE through the visualization and overlay of the other datasets. Such implicit links can later on be asserted to enrich all datasets. Layers that contain solely geospatial information will be retrieved by evaluating a GeoSPARQL query on a Strabon, Oracle, Parliament, or Virtuoso endpoint, while layers that contain spatial and temporal information will be retrieved by evaluating an stsparql query on Strabon. The reason for this choice is that stsparql is the only language that provides the spatial and temporal primitives that are needed for this scenario, while Strabon is currently the only temporally-enabled geospatial RDF store as we have discussed in [1]. In this respect our demo will also serve to showcase the new functionalities of the system Strabon as presented in [1]. A video demonstration of SexTant that follows the scenario described in this section is available at References 1. Bereta, K., Smeros, P., Koubarakis, M.: Representation and querying of valid time of triples in linked geospatial data. In: ESWC. LNCS, vol. 7882, pp (2013) 2. Koubarakis, M., Karpathiotakis, M., Kyzirakos, K., Nikolaou, C., Sioutis, M.: Data Models and Query Languages for Linked Geospatial Data. LNCS, vol. 7487, pp Springer (2012) 3. Kyzirakos, K., Karpathiotakis, M., Koubarakis, M.: Strabon: A Semantic Geospatial DBMS. In: ISWC. LNCS, vol. 7649, pp Springer (2012) 4. Nikolaou, C., Dogani, K., Kyzirakos, K., Koubarakis, M.: Sextant: Browsing and Mapping the Ocean of Linked Geospatial Data. ESWC Demo paper

64 TELEIOS FP Figure 4.1: Facet-graph-based editor (left) and text-based editor (right) Figure 4.2: Visual Query Builder system overview Figure 4.3: Technical framework for the Query Builder D4.3 The evaluation of the developed implementation 57

65 TELEIOS FP Figure 4.4: Overall view on the functionalities of the Query Builder D4.3 The evaluation of the developed implementation 58

TELEIOS FP7-257662 Figure 4.5: Screenshot of the Query Builder UI showing a Facet-Graph based query representation for the DLR Use Case Figure 4.

66 TELEIOS FP Figure 4.5: Screenshot of the Query Builder UI showing a Facet-Graph based query representation for the DLR Use Case Figure 4.6: Screenshot of the Query Builder UI showing the stsparql equivalent of the Facet- Graph based query representation for the DLR Use Case in the text based Query Editor D4.3 The evaluation of the developed implementation 59

Geographica: A Benchmark for Geospatial RDF Stores (Long Version)

Geographica: A Benchmark for Geospatial RDF Stores (Long Version) George Garbis, Kostis Kyzirakos, and Manolis Koubarakis National and Kapodistrian University of Athens, Greece {ggarbis,kk,koubarak}@di.uoa.gr