The Semantic Web  of Linked Data  is continuously growing and changing over time . While in some cases only the latest version of datasets are required, there is a growing need for access to prior dataset versions for data analysis. For example, analyzing the evolution of taxonomies, or tracking the evolution of diseases in biomedical datasets.
Several approaches have already been proposed to store and query versioned Linked Datasets. However, surveys [4, 5] have shown that there is a need for improved versioning capabilities in the current systems. Existing solutions either perform well for versioned query evaluation, or require less storage space, but not both. Furthermore, no existing solution performs well for all versioned query types, namely querying at, between, and for different versions.
In recent work, we introduced a compressed RDF archive indexing technique —implemented under the name of OSTRICH— that enables highly efficient triple pattern-based versioned querying capabilities. It offers a new trade-off compared to other approaches, as it calculates and stores additional information at ingestion time in order to reduce query evaluation time. This additional information includes pointers to relevant positions to improve the efficiency of result offsets. Furthermore, it supports efficient result cardinality estimation, streaming results and offset support to enable efficient usage within query engines.
The Mighty Storage Challenge (MOCHA) is a yearly challenge that aims to measure and detect bottlenecks in RDF triple stores. One of the tasks in this challenge concerns the storage and querying of versioned datasets. This task uses the SPBv  benchmark that consists of a dataset and SPARQL query workload generator for different versioned query types. All MOCHA tasks are to be evaluated on the HOBBIT benchmarking platform. SPBv evaluates SPARQL queries , hence we combine OSTRICH, a versioned triple index with triple pattern interface, with Comunica , a modular SPARQL engine platform.
The remainder of this paper is structured as follows. First, the next section briefly introduces the OSTRICH store and the Comunica SPARQL engine. After that, we present our preliminary results in Section 3. Finally, we conclude and discuss future work in Section 4.
Versioned Query Engine
In this section we introduce the versioned query engine that consists of the OSTRICH store and the Comunica framework. We discuss these two parts separately in the following sections.
OSTRICH is the implementation of a compressed RDF archive indexing technique  that offers efficient triple pattern queries in, between, and over different versions. In order to achieve efficient querying for these different query types, OSTRICH uses a hybrid storage technique that is a combination of individual copies, change-based and timestamp-based storage. The initial dataset version is stored as a fully materialized and immutable snapshot. This snapshot is stored using HDT , which is a highly compressed, binary RDF representation. All other versions are deltas, i.e., lists of triples that need to be removed and lists of triples that need to be added. These deltas are relative to the initial version, but merged in a timestamp-based manner to reduce redundancies between each version. In order to offer optimization opportunities to query engines that use this store, OSTRICH offers efficient cardinality estimation, streaming results and efficient offset support.
Comunica  is a highly modular Web-based SPARQL query engine platform. Its modularity enables federated querying over heterogeneous interfaces, such as SPARQL endpoints , Triple Pattern Fragments (TPF) entrypoints  and plain RDF files. New types of interfaces and datasources can be supported by implementing an additional software component and plugging it into a publish-subscribe-based system through an external semantic configuration file.
In order to support versioned SPARQL querying over an OSTRICH backend, we implemented a module for resolving triple patterns with a versioning context against an OSTRICH dataset. Furthermore, as versions within the SPBv benchmark are represented as named graphs, we rewrite these queries in a separate module to OSTRICH-compatible queries in, between, or over different versions as a pre-processing step. Finally, we provide a default Comunica configuration and script to use these modules together with the existing Comunica modules as a SPARQL engine. These three modules will be explained in more detail hereafter.
OSTRICH enables versioned triple pattern queries at, between, and for different versions. These query types are respectively known as Version Materialization (VM), Delta Materialization (DM) and Version Querying (VQ). In the context the SPBv benchmark, only the first two query types (VM and DM) are evaluated, which is why only support for these two are implemented in the OSTRICH module at the time of writing.
The OSTRICH Comunica module consists of an actor that enables VM and DM triple pattern queries against a given OSTRICH store,
and is registered to Comunica’s
This actor will receive messages consisting of a triple pattern and a context.
This actor expects the context to either contain VM or DM information, and a reference to an OSTRICH store.
For VM queries, a version identifier must be provided in the context.
For DM queries, a start and end version identifier is expected.
rdf-resolve-quad-pattern bus expects two types of output:
- A stream with matching triples.
- An estimated count of the number of matching triples.
Versioned Query Rewriter Module
The SPBv benchmark represents versions as named graphs. Listing 1 and Listing 2 respectively show examples of VM and DM queries in this representation. Our second module is responsible for rewriting such named-graph-based queries into context-based queries that the OSTRICH module can accept.
In order to transform VM named-graph-based queries,
GRAPH clauses, and consider them to be identifiers for the VM version.
Our rewriter unwraps the pattern(s) inside this
and attaches a VM version context with the detected version identifier.
For transforming DM named-graph-based queries,
GRAPH clauses with corresponding
GRAPH clauses for the same pattern in the same scope are detected.
The rewriter unwraps the equal pattern(s),
and constructs a DM version context with a starting and ending version identifier.
The starting version is always the smallest of the two graph URIs,
and the ending version is the largest, assuming lexicographical sorting.
If the graph URI from the first pattern is larger than the second graph URI, then the DM queries only additions.
In the other case, only deletions will be queried.
The Comunica platform allows SPARQL engines to be created based on a semantic configuration file. By default, Comunica has a large collection of modules to create a default SPARQL engine. For this work, we adapted the default configuration file where we added our OSTRICH and rewriter modules. This allows complete versioned SPARQL querying, instead of only versioned triple pattern querying, as supported by OSTRICH. This engine is available on the npm package manager for direct usage.
In this section, we introduce the results of running the SPBv benchmark on Comunica and OSTRICH.
As the MOCHA challenge requires running a system within the Docker-based HOBBIT platform, we provide a system adapter with a Docker container for our engine that is based on Comunica and OSTRICH. Using this adapter, we ran the SPBv benchmark on our system on the HOBBIT platform with the parameters from Table 1.
|Triples in version 1||100,000|
|Version deletion ratio||10%|
|Version addition ratio||15%|
Table 1: Configuration of the SPBv benchmark for our experiment.
For the used configuration, our system is able to ingest 29,719 triples per second for the initial version, and 5,858 per second for the following changesets. The complete dataset requires 17MB to be stored using our system. The initial version ingestion is significantly faster because the initial version is stored directly as a HDT snapshot. For each following changeset, OSTRICH requires more processing time as it calculates and stores additional metadata and converts the changeset to one that is relative to the initial version instead of the preceding version.
For the 99 queries that were evaluated, our system failed for 27 of them according to the benchmark. The majority of failures is caused by incomplete SPARQL expression support in Comunica, which is not on par with SPARQL 1.1 at the time of writing. The other failures (in task 5.1) are caused by an error in the benchmark where changes in literal datatypes are not being detected. We are in contact with the benchmark developer to resolve this.
For the successful queries, our system achieves fast query evaluation times for all query types, as shown in Table 2. In summary, the query of type 1 (queries starting with a 1-prefix) completely materializes the latest version, type 2 queries within the latest version, type 3 retrieves a full past version, type 4 queries within a past version, type 5 queries the differences between two versions, and type 8 queries over two different versions. Additional details on the query types can be found in the SPBv  article.
Table 2: Evaluation times in milliseconds and the number of results for all SPBv queries that were evaluated successfully.
This article represents an entry for the versioning task in the Mighty Storage Challenge 2018 as part of the ESWC 2018 Challenges Track. Our work consists of a versioned query engine with the OSTRICH versioned triple store and the Comunica SPARQL engine platform. Preliminary results show fast query evaluation times for the queries that are supported. The list of unsupported queries is being used as a guideline for the further development of OSTRICH and Comunica.
During the usage of the SPBv benchmark, we identified several KPIs that are explicitly supported by OSTRICH, but were not being evaluated at the time of writing. We list them here as a suggestion to the benchmark authors for future work:
- Measuring storage size after each version ingestion.
- Reporting of the ingestion time of each version separately, next of the current average.
- Evaluation of querying all versions at the same time, and retrieving their applicable versions.
- Evaluation of stream-based query results and offsets, for example using a diefficiency metric .
In future work, we intend to evaluate our system using different configurations of the SPBv benchmark, such as increasing the number of versions and increasing the change ratios. Furthermore, we intend to compare our system with other similar engines, both at triple index-level, and at SPARQL-level.