Man vs. Machine Dierences in SPARQL Queries

Similar documents
A preliminary investigation into SPARQL query complexity and federation in Bio2RDF

DBpedia-An Advancement Towards Content Extraction From Wikipedia

ORES-2010 Ontology Repositories and Editors for the Semantic Web

Semantic Web Information Management

LinDA: A Service Infrastructure for Linked Data Analysis and Provision of Data Statistics

Analyzing Real-World SPARQL Queries in the Light of Probabilistic Data

Semantic Web in a Constrained Environment

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

Enriching an Academic Knowledge base using Linked Open Data

YASGUI: How do we Access Linked Data? 1

The role of vocabularies for estimating carbon footprint for food recipies using Linked Open Data

Utilizing, creating and publishing Linked Open Data with the Thesaurus Management Tool PoolParty

Understanding Billions of Triples with Usage Summaries

The Open Government Data Stakeholder Survey

An Archiving System for Managing Evolution in the Data Web

Graph Analytics in the Big Data Era

Revisiting Blank Nodes in RDF to Avoid the Semantic Mismatch with SPARQL

LODatio: A Schema-Based Retrieval System forlinkedopendataatweb-scale

An Entity Based RDF Indexing Schema Using Hadoop And HBase

What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD

What you have learned so far. Interoperability. Ontology heterogeneity. Being serious about the semantic web

Linked Open Europeana: Semantic Leveraging of European Cultural Heritage

Towards Equivalences for Federated SPARQL Queries

Linking Entities in Chinese Queries to Knowledge Graph

IJCSC Volume 5 Number 1 March-Sep 2014 pp ISSN

VISO: A Shared, Formal Knowledge Base as a Foundation for Semi-automatic InfoVis Systems

Graph Exploration: Taking the User into the Loop

An FCA Framework for Knowledge Discovery in SPARQL Query Answers

Open Research Online The Open University s repository of research publications and other research outputs

Theme Identification in RDF Graphs

Computer-assisted Ontology Construction System: Focus on Bootstrapping Capabilities

Publishing data for maximized reuse

R2RML by Assertion: A Semi-Automatic Tool for Generating Customised R2RML Mappings

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

Using Linked Data Concepts to Blend and Analyze Geospatial and Statistical Data Creating a Semantic Data Platform

SemSor: Combining Social and Semantic Web to Support the Analysis of Emergency Situations

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision

Enabling fine-grained HTTP caching of SPARQL query results

Information-theoretic Analysis of Entity Dynamics on the Linked Open Data Cloud

SPARQL Protocol And RDF Query Language

Capturing the Currency of DBpedia Descriptions and Get Insight into their Validity

Orchestrating Music Queries via the Semantic Web

How to Publish Linked Data on the Web - Proposal for a Half-day Tutorial at ISWC2008

Efficient, Scalable, and Provenance-Aware Management of Linked Data

University of Rome Tor Vergata DBpedia Manuel Fiorelli

Comparative Study of RDB to RDF Mapping using D2RQ and R2RML Mapping Languages

Querying the Semantic Web

A Storage, Retrieval, and Application Platform for Ultra-Large-Scale Linked Data

Accessing information about Linked Data vocabularies with vocab.cc

Programming the Semantic Web

Caching and Prefetching Strategies for SPARQL Queries

Open Research Online The Open University s repository of research publications and other research outputs

An overview of RDB2RDF techniques and tools

Answering Aggregate Queries Over Large RDF Graphs

A Linked-Data-Driven and Semantically-Enabled Journal Portal for Scientometrics

Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

3 Publishing Technique

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

A Linked-Data-Driven and Semantically-Enabled Journal Portal for Scientometrics

A Tagging Approach to Ontology Mapping

16th International World Wide Web Conference Developers Track, May 11, DBpedia. Querying Wikipedia like a Database

Combining Government and Linked Open Data in Emergency Management

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

A rule-based approach to address semantic accuracy problems on Linked Data

Online Index Extraction from Linked Open Data Sources

Benchmarking RDF Production Tools

Representing Linked Data as Virtual File Systems

Europeana update: aspects of the data

Durchblick - A Conference Assistance System for Augmented Reality Devices

Linked Data Querying through FCA-based Schema Indexing

Collaboratively Patching Linked Data A Patch Repository for Linked Datasets

An Annotation Tool for Semantic Documents

Development of guidelines for publishing statistical data as linked open data

Extracting knowledge from Ontology using Jena for Semantic Web

FAGI-gis: A tool for fusing geospatial RDF data

Linked Data Practices for the Geospatial Community


Scalewelis: a Scalable Query-based Faceted Search System on Top of SPARQL Endpoints

A Community-Driven Approach to Development of an Ontology-Based Application Management Framework

Bibster A Semantics-Based Bibliographic Peer-to-Peer System

Towards Open Innovation with Open Data Service Platform

On Measuring the Lattice of Commonalities Among Several Linked Datasets

SWSE: Objects before documents!

arxiv: v1 [cs.ir] 16 Feb 2018

Linked Open Europeana: Semantics for the Digital Humanities

An Evaluation Dataset for Linked Data Profiling

Semantic Web Fundamentals

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

Meta Search Engine Powered by DBpedia

Keyword Search over RDF Graphs. Elisa Menendez

Reducing Consumer Uncertainty

Heuristics-based Query Reordering for Federated Queries in SPARQL 1.1 and SPARQL-LD

Document Retrieval using Predication Similarity

SPARQL Protocol And RDF Query Language

A Developer s Guide to the Semantic Web

Cluster-based Instance Consolidation For Subsequent Matching

Web Ontology for Software Package Management

PECULIARITIES OF LINKED DATA PROCESSING IN SEMANTIC APPLICATIONS. Sergey Shcherbak, Ilona Galushka, Sergey Soloshich, Valeriy Zavgorodniy

Using linked data to extract geo-knowledge

Querying Data with Transact SQL

Transcription:

Man vs. Machine Dierences in SPARQL Queries Laurens Rietveld 1 and Rinke Hoekstra 1,2 1 Department of Computer Science, VU University Amsterdam, The Netherlands {laurens.rietveld,rinke.hoekstra}@vu.nl 2 Leibniz Center for Law, University of Amsterdam, The Netherlands hoekstra@uva.nl Abstract. Server-side SPARQL query logs have been a topic of study for some time now. The collection of query logs is currently the primary source of information for researchers. A recurring problem is that these logs leave application queries and queries created by humans indistinguishable. In this paper, we present a new collection of manually created queries, that were collected through the SPARQL interface. We show how these queries dier from those taken from server logs. Keywords: SPARQL, endpoints, Semantic Web, Linked Data 1 Introduction SPARQL queries are the standard for querying the Linked Open Data (LOD) cloud. They are an interesting subject of study: analyzing SPARQL query logs can help optimize triple-stores, understand and improve user interaction, improve the way we rank the query results, and much more. Unfortunately, these query logs are hard to come by and the only (semi) public source of query logs is collected and made available by the workshop [3]. These query logs constitute the primary source of information about how SPARQL endpoints are used. However, because these concern server side logs, it is near impossible to reliably distinguish between queries directly created by the (human) user, and queries coming from applications. An example of the latter are queries executed by the Pubby interface [4]. Browsing an endpoint via this interface automatically executes several SPARQL queries in the background (all having exactly the same structure, though with dierent resources). The need to distinguish between machines and humans, is recognized by others as well [9,10,13], and is important for research involving users. This paper provides more context to the server side logs by showing what manually created queries look like, and how they structurally dier from server log queries. We make this comparison using client side query logs This work was supported by the Dutch national program COMMIT/

from the SPARQL editor 3, presented in [14]. (Yet Another SPARQL Graphical User Interface) is a feature-packed SPARQL query editor, capable of accessing any SPARQL endpoint. In Section 2 we present the related work. Then, in Section 3 we discuss how we collect and analyze the queries, followed by Section 4, where we present our results. We conclude in Section 5. 2 Related Work Research related to SPARQL query templates and user sessions [9] showed a large number of similarly-structured SPARQL queries in the dataset. The authors argue that these queries are most likely issued by machine agents. Only a small percentage of relatively short query sessions show heterogeneous structure, possibly indicating human users. Similar work done by [13] extracts user sessions from the logs, where the authors try to dierentiate between both human and machine `user' sessions. They observe that A very small set of users contribute to a large percentage of the queries, and hypothesize that these users are machine agents. Although attempts to distinguish between humans and machines provide insightful information, it remains impossible to distinguish between both types of queries with a known degree of certainty. The burden for distinguishing between both types of queries can be alleviated with the availability of a man-made collection of queries. Such a query collection is particularly interesting for Linked Data research with a strong user aspect, such as detecting user browsing patterns [8], user modeling [5] and query modication assistants for users [7]. 3 Approach We collect the man-made queries via the interface, which is packed with usability features such as auto-completion, syntax highlighting, dataset endpoint search, and sharing functionalities. When given permission to do so, tracks the actions of users using Google Analytics 4, including the endpoint accessed by the user, the specic query, and the time it takes to execute the query. 5 Since the public launch of early 2013, it has been quite successful in attracting visitors from across the world (See Figure 1). Based on our logs we observe that received 2.611 unique visitors, and 39.201 queries were executed against 645 endpoints. Because 40% of the users do not allow collecting these statistics, the gures presented in this paper only cover a subset of the user base. 3 See http://yasgui.org 4 See http://www.google.com/analytics/ 5 Tracking query execution time was added recently, and not included in this paper

Netherlands 24% 29% United States France 3% 3% 6% 5% 8% 22% United Kingdom Germany Italy Japan Other Fig. 1: Location of users We compare queries to the server logs from the 2014 challenge, which contain logs from DBpedia [1], Linked Geo Data [2] and Bio- Portal [11]. Because DBpedia is the most popular endpoint in the logs (around 15.000 queries), and because the number of Linked Geo Data and Bio- Portal queries we could collect is relatively low (100 and 5 queries, respectively), we only compare and queries to DBpedia. Ideally, we avoid a bias by removing duplicate queries on a per-user basis, as users tend to send the same query multiple times. However, because Google Analytics mostly presents aggregate information, and no detailed visitor information (e.g. user-agent or host name), we cannot remove duplicates (per user) for our logs. Therefore, for the sake of comparison, all the statistics presented in this papers are performed on the complete set of queries in our query sets. The only requirement we impose, is that all queries should be syntactically correct. Our analysis of the and logs consists of extracting structural properties from the queries, and is largely based on [6], and to a lesser extend [12]. We extract these properties using the Jena query parser 6. The properties we extract for each query are: 1. Query Type, e.g. SELECT or CONSTRUCT 2. The use of SPARQL features such as DISTINCT, UNION, ORDER BY and LIMIT. 3. The types of triple patterns, i.e. an RDF triple in which each item can be a variable. For each item in the triple pattern, we analyze whether this is a variable (V), a constant (C, either a URI, literal or blank node), or a property path (*). We would represent a triple pattern such as [] rdfs:label?label as C C V. 6 See http://jena.apache.org/

#syntactically valid queries 13.242 100.763 % unique queries 65.73% 69.67% Type SELECT 93.91% 96.17% DESCRIBE 0.72% 3.00% CONSTRUCT 1.49% 0.56% ASK 3.87% 0.26% SPARQL features ORDER BY 18.64% 6.37% DISTINCT 15.07% 24.37% LIMIT 42.32% 12.13% OFFSET 0.14% 0.75% FILTER 30.35% 15.11% Subquery 2.82% 0.37% SERVICE 0.91% 0.08% 1 UNION in query 24.04% 4.46% 1 OPTIONAL in query 1.71% 3.14% Table 1: Use of SPARQL grammar 4. The number and types of joins. We use the same approach as [6] in counting the types of joins. There are 6 possible types of joins, depending on the position of the common variable between two triple patterns: Subject-Subject, Predicate-Predicate, Object-Object, Subject- Predicate, Subject-Object and Predicate-Object. 4 Results Table 1 shows the range of SPARQL features used by both query sets. In general, the query types roughly correspond, where the use of SPARQL features and solution modiers greatly dier. The LIMIT solution modier occurs in roughly 42% of the queries, and only in 12% of the queries. Additionally, the FILTER function shows a striking dierence (roughly 30% vs 15%), as well as the number of queries containing UNION clauses (roughly 24% vs 5%). We observe that most of the SPARQL features are used more often in queries than in queries, which might indicate that human users use a wider range of SPARQL features than queries coming from applications (assuming the majority of queries originate from applications [9,13]). We can observe structural dierences between the queries in both sets as well. In gure 2, we show the number of triple patterns per query (note the logarithmic scale). Only 23% of the queries contain more than a single triple pattern, compared to 68% of queries. As Table 2 shows, the type of triple patterns diers greatly as well. Patterns of the form V C C or

Percentage of Queries Percentage of Joins PErcentage of Queries (log) V C V occur roughly twice as often in the logs than in the logs. However, the pattern C C V is the most dominant one in the logs (42.71%), where it occurs in only 7.10% of the triple patterns. 100.00% 10.00% 1.00% 0.10% 0.01% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of triple patterns Fig. 2: #Triple patterns per query (log) V C C 45.91% 19.92% V C V 36.22% 19.06% C C V 7.10% 42.71% C V V 2.95% 6.10% V V C 2.88% 3.92% V V V 1.61% 1.88% C C C 0.28% 0.17% C V C 0.12% 0.06% V * C 2.59% 6.01% V * V 0.30% 0.10% C * V 0.05% 0.08% C * C 0.00% 0.00% Table 2: Triple patterns types (C=constant, V=variable, *=property path) 90% 80% 70% 60% 50% 90% 80% 70% 60% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0 1 2 3 4 5 6 7 8 9 10 #Joins 0% SS SO OO PP SP PO Proportion of join types Fig. 3: Number of joins in queries Fig. 4: Proportion of all join types Finally, Figure 3 shows that the number of joins diers greatly between both query sets. This is for an important part due to the large number of queries containing only a single triple pattern (i.e. by denition no join). Figure 4 shows the proportions of join types for both query sets, showing little dierence in join types between both query sets.

5 Discussion This paper shows that user queries obtained at client side are very dierent from queries logged by SPARQL servers: the server queries are smaller, have fewer joins, and use less features of the SPARQL grammar. This indicates that man made queries are more complex than routine queries red by applications. We can furthermore conclude that both types of queries are structurally very dierent: the type of triple patterns observable in queries deviates greatly between both query sets. We believe that our results further corroborate the work of [9,13], and that these dierences are largely caused by the many application queries in the logs. This insight in the structural dierences between query sets is important for research related to users on the Web of Data. For such research, the query logs can grow to be an interesting, more reliable source of data as it only contains man-made queries. A second advantage is that the range of endpoints for which collects query logs is higher than any query collection currently available, allowing for a broader analysis on how the Web of Data is used. With time, following an increase in uptake of, we can paint an even clearer picture on how users use SPARQL and the Web of Data.. References 1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. In: The semantic web, pp. 722735. Springer (2007) 2. Auer, S., Lehmann, J., Hellmann, S.: Linkedgeodata: Adding a spatial dimension to the web of data. In: The Semantic Web-ISWC 2009, pp. 731746. Springer (2009) 3. Berendt, B., Hollink, L., Luczak-Rösch, M., Möller, K., Vallet, D.: Proceedings of 2014-4th international workshop on usage analysis and the web of data. In: 11th ESWC (2014) 4. Cyganiak, R., Bizer, C.: Pubby-a linked data frontend for sparql endpoints. Retrieved fro m http://www4. wiwiss. fu-berlin. de/pubby/at May 28, 2011 (2008) 5. Fortuna, B., Mladenic, D., Grobelnik, M.: User modeling combining access logs, page content and semantics. arxiv preprint arxiv:1103.5002 (2011) 6. Gallego, M.A., Fernández, J.D., Martinez-Prieto, M.A., de la Fuente, P.: An aempirical study of real-world SPARQL queries. In: UseWod Workshop. pp. 36 (2011), http://arxiv.org/abs/1103.5043 7. Hollink, V., de Vries, A.: Towards an automated query modication assistant. arxiv preprint arxiv:1104.0128 (2011) 8. Hoxha, J., Junghans, M., Agarwal, S.: Enabling semantic analysis of user browsing patterns in the web of data. arxiv preprint arxiv:1204.2713 (2012) 9. Lorey, J., Naumann, F.: Detecting sparql query templates for data prefetching. In: The Semantic Web: Semantics and Big Data, pp. 124139. Springer (2013) 10. Möller, K., Hausenblas, M., Cyganiak, R., Handschuh, S.: Learning from linked open data usage: Patterns & metrics (2010) 11. Noy, N.F., Shah, N.H., Whetzel, P.L., Dai, B., Dorf, M., Grith, N., Jonquet, C., Rubin, D.L., Storey, M.A., Chute, C.G., et al.: Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic acids research 37(suppl 2), W170 W173 (2009)

12. Picalausa, F., Vansummeren, S.: What are real sparql queries like? In: Proceedings of the International Workshop on Semantic Web Information Management. p. 7. ACM (2011) 13. Raghuveer, A.: Characterizing machine agent behavior through sparql query mining. In: Proceedings of the International Workshop on Usage Analysis and the Web of Data, Lyon, France (2012) 14. Rietveld, L., Hoekstra, R.: Yasgui: Not just another sparql client. In: The Semantic Web: ESWC 2013 Satellite Events, pp. 7886. Springer (2013)