The Fundamentals of Semantic Versioned Querying

Abstract

The domain of RDF versioning concerns itself with the storage of different versions of Linked Datasets. The ability of querying over these versions is an active area of research, and allows for basic insights to be discovered, such as tracking the evolution of certain things in datasets. Querying can however only get you so far. In order to derive logical consequences from existing knowledge, we need to be able to reason over this data, such as ontology-based inferencing. In order to achieve this, we explore fundamental concepts on semantic querying of versioned datasets using ontological knowledge. In this work, we present these concepts as a semantic extension of the existing RDF versioning concepts that focus on syntactical versioning. We remain general and assume that versions do not necessarily follow a purely linear temporal relation. This work lays a foundation for reasoning over RDF versions from a querying perspective, using which RDF versioning storage, query and reasoning systems can be designed.

Introduction

RDF versioning has been an active area of research that looks into storage and querying techniques for different versions of Linked Datasets. These versions do not necessarily follow a purely linear temporal relation, as multiple different versions or branches of versions could exist at the same time, as opposed to RDF streams [1].

One of the key benefits of RDF is its ability to encode semantics into the data through the use of ontologies. This allows reasoners to interpret this data, and use this encoded meaning to transform and infer knowledge. Reasoning within static RDF data and temporal RDF streams are already well-established research domains. Reasoning over RDF versions is however still an underexplored domain. Some preliminary work has already been done in the domain of reasoning over RDF versions. In one work for example, the calculation of semantic differences [2] between versions has been investigated. In another work, reasoning with multi-version ontologies [3] was investigated. These works only cover very specific parts of the reasoning demands within RDF versioning.

For instance, given a versioned dataset, it is currently impossible with the existing systems to efficiently find all versions in which a certain fact can be inferred. Furthermore, the domain of combining reasoning and querying —as is done in Ontology-Based Data Access techniques [4, 5]— within the domain of RDF versioning remains unexplored.

In this work, we introduce a general formalization of semantic versioned querying, i.e., reasoning within RDF versioning from the querying perspective, For this, we extend the five foundational versioned query types [6] that were introduced to cover the retrieval demands within RDF versioning. We formalize concepts such as reasoning within a single version, version differences, and different versions. Furthermore, we present a prototypical implementation of a versioned RDF store that offers basic rule-based reasoning capabilities at query-time. This prototype demonstrates the benefits of semantic versioning, such as finding all versions in which a certain fact can be inferred, and storage space reduction by inferring facts instead of materializing them beforehand.

The aim of this work is to provide a foundation for the future research and development of semantic versioned querying within RDF stores. This will lead to improvements inside domains that require the semantic analysis on Linked Datasets, for example for analyzing concept drift [7] or tracking diseases in biomedical datasets [8] over time.

This article is structured as follows: In the next section, we discuss related work on semantic versioned querying. In Section 3, we discuss the fundamental concepts on RDF versioning. After that, in Section 4, we introduce new foundational semantic versioned query atoms. In Section 5, we present a proof-of-concept implementation of these atoms with a preliminary evalution. Finally, we conclude in Section 6.

Fundamentals

In this section, we introduce the fundamental concepts on RDF archiving and the semantic diff.

RDF Archiving

An RDF archive [6] has been defined by Fernández et al. as follows:

An RDF archive graph is a set of version-annotated triples, where a version-annotated triple (s, p, o):[i] is an RDF triple (s, p, o) with a label i representing the version in which this triple holds. The set of all RDF triples [22] is defined as (U ∪ B) × U × (U ∪ B ∪ L), where U, B, and L, respectively represent the disjoint, infinite sets of URIs, blank nodes, and literals. Finally, an RDF version of an RDF archive A at snapshot i is the RDF graph A(i) = {(s, p, o)|(s, p, o):[i] ∈ A}.

For the remainder of this article, we use the shorthand notation Ai to refer to the RDF version A(i).

To cover the retrieval demands in RDF archiving—also known as RDF versioning—, five foundational query types were introduced [6], which are referred to as query atoms. These query atoms are based on the RDF data model [22] and SPARQL query language [23]. In these models, a triple pattern is defined as (U ∪ V) × (U ∪ V) × (U ∪ L ∪ V), with V being the infinite set of variables. A set of triple patterns is called a Basic Graph Pattern, which forms the basis of a SPARQL query. The evaluation of a SPARQL query Q on an RDF graph G containing RDF triples, produces a bag of solution mappings [[Q]]G.

The five foundational query atoms introduced by Fernández et al. are the following:

  1. Version materialization (VM) retrieves data using a query Q targeted at a single version Ai.
    Formally: VM(Q, Ai) = [[Q]]Ai.
    Example: Which books were present in the library yesterday?
  2. Delta materialization (DM) retrieves query Q’s result change sets between two versions Ai and Aj.
    Formally: DM(Q, Ai, Aj)=(Ω+, Ω). With Ω+ = [[Q]]Ai \ [[Q]]Aj and Ω = [[Q]]Aj \ [[Q]]Ai.
    Example: Which books were returned or taken from the library between yesterday and now?
  3. Version query (VQ) annotates query Q’s results with the versions (of RDF archive A) in which they are valid.
    Formally: VQ(Q, A) = {(Ω, W) | W = {A(i) | Ω=[[Q]]A(i), i ∈ N} ∧ Ω ≠ ∅}.
    Example: At what times was book X present in the library?
  4. Cross-version join (CV) joins the results of two queries (Q1 and Q2) between versions Ai and Aj.
    Formally: VM(Q1, Ai) ⨝ VM(Q2, Aj).
    Example: What books were present in the library yesterday and today?
  5. Change materialization (CM) returns a list of versions in which a given query Q produces consecutively different results.
    Formally: {(i, j) | i < j, DM(Q, A(i), A(j)) = (Ω+, Ω), Ω+ ∪ Ω ≠ ∅, ∄ k ∈ ℕ : i < k < j}.
    Example: At what times was book X returned or taken from the library?

Semantic Diff

Völkel et al. introduce the concept of a semantic diff [2] that takes the semantics of an ontology language into account when calculating the diff, which is not the case for a regular structural diff, which calculates which triples have been added and which ones have been removed. As an example, consider the dataset with two versions from Listing 1. The typical, structural diff just takes the difference between these two versions at triple level, without taking into account the meaning of the triples. The structural diff of this example can be found in Listing 2. A semantic diff on the other hand, takes into account the meaning of the data. For this example, we know that cat is a subclass of animal. Therefore, the removal of Bob being an animal does not actually take place, because it can still be inferred in version 1, as shown in Listing 3.

Version 0:
    ex:Bob a ex:Animal.
    ex:Bob foaf:name "Bob".

Version 1:
    ex:Bob a ex:Cat.
    ex:Bob foaf:name "Bob".

Language:
    ex:Cat rdf:subClassOf ex:Animal.

Listing 1: A simple example dataset with two versions about Bob the cat, with a separate ontology language.

Removed:
    ex:Bob a ex:Animal.

Added:
    ex:Bob a ex:Cat.

Listing 2: The structural diff between the two versions in Listing 1.

Removed:
Added:
    ex:Bob a ex:Cat.

Listing 3: The semantic diff between the two versions in Listing 1.

The semantic closure sl(A) of a set of RDF triples A is the set of all statements that can be concluded from the statements in A under the semantics of the RDF-based ontology language l. A semantic diff dl(A,B) of two sets of RDF triples (A and B) is formally defined by Völkel et al. as dl(A,B) = (+l(A,B),−l(A,B)), with +l(A, B) = sl(B) \ (sl(A) ∩ sl(B)) and l(A, B) = sl(A) \ (sl(A) ∩ sl(B)).

Semantic Versioned Query Atoms

As discussed in Section 3, there exist five foundational query atoms for querying RDF archives. In this section, we introduce semantic extensions of these query atoms, similar to the structural diff to semantic diff extension introduced by Völkel et al. More concretely, we will extend these five query atoms with parameters for language versioning, instead of only dataset versioning.

Versioned Semantic Closure

To remain in line with the definitions on RDF archiving as listed in Section 3, we extend the semantic closure definition by Völkel et al. as follows: The versioned semantic closure s(Ai, Lj) of a version Vi is the set of all triples that can be inferred from the triples in Ai under the semantics of the RDF-based ontology language Lj. In this definition, we consider the RDF-based ontology language l to represent an RDF archive as well, for which we use the notation Lj to refer to the RDF version j of the archive L.

Query Atoms

In this section, we describe the semantic extensions of the five foundational query atoms for querying RDF archives. The atoms that apply to multiple versions (VQ and CM) are subcategorized to handle the different combinations of single and cross-version dataset and ontology versions.

Each extension is described formally, and an example of its usage is given. All examples apply to the use case of a cat shelter that makes use of an evolving ontology of cat species.

The semantic extension of the five versioned query atoms are defined as follows:

  1. Semantic version materialization (S-VM) retrieves data using a query Q targeted at a single version Ai in a single ontology version Lj.
    Formally: S-VM(Q, Ai, Lj) = [[Q]]s(Ai, Lj)
    Example: Which African wild cats were present in the shelter yesterday according to last year’s classification?
  2. Semantic delta materialization (S-DM) retrieves query Q’s result change sets between two versions Ai and Aj respectively using ontology version Lk and Ll.
    Formally: S-DM(Q, Ai, Aj, Lk, Ll) = (Ω+, Ω). With Ω+ = [[Q]]s(Ai, Lk) \ [[Q]]s(Aj, Ll) and Ω = [[Q]]s(Aj, Ljl) \ [[Q]]s(Ai, Lk)
    Example: Which cats that were present in the shelter since ten years ago became a different species over the last year?
  3. Semantic version query (S-VQ)
    1. Semantic intermodal and interontological version query (MOS-VQ) annotates query Q’s results with the versions of RDF archive A and ontology L in which they are valid.
      Formally: S-VQ(Q, A, L) = {(Ω, V) | V = {(Ai, Lj) | Ω=[[Q]]s(Ai, Lj), i, j ∈ N} ∧ Ω ≠ ∅}
      Example: At what points in time were there African wild cats in the schelter, and according to the classification of what time?
    2. Semantic intermodal version query (MS-VQ) annotates query Q’s results with the versions of RDF archive A in which they are valid according to ontology version Lj.
      Formally: S-VQ(Q, A, Lj) = {(Ω, V) | V = {Ai | Ω=[[Q]]s(Ai, Lj), j ∈ N} ∧ Ω ≠ ∅}
      Example: At what points in time were there African wild cats in the schelter?
    3. Semantic interontological version query (OS-VQ) annotates query Q’s results with the versions of ontology L in which they are valid within dataset version Ai.
      Formally: S-VQ(Q, Ai, L) = {(Ω, V) | V = {Lj | Ω=[[Q]]s(Ai, Lj), i ∈ N} ∧ Ω ≠ ∅}
      Example: At what points in time were the cats that are currently in the shelter ever African wild cats?
  4. Semantic cross-version join (S-CV) joins the results of two queries (Q1 and Q2) between versions Ai and Aj respectively using ontology version Lk and Ll.
    Formally: S-CV(Q1, Q2, Ai, Aj, Lk, Ll) = S-VM(Q1, Ai, Lk) ⨝ S-VM(Q2, Aj, Ll)
    Example: Which African wild cats were in the shelter yesterday (according to last year’s classification) and the day before (according to the current classification)?
  5. Semantic change materialization (S-CM)
    1. Semantic intermodal and interontological change materialization (MOS-CM) returns a list of consecutive archive and ontology versions in which a given query Q produces different results.
      Formally: S-CV(Q, A, L) = {(i, j, k, l) | i < j, k < l, S-DM(Q, Ai, Aj, Lk, Ll) = (Ω+, Ω), Ω+ ∪ Ω ≠ ∅, ∄ a ∈ ℕ : i < a < j, ∄ b ∈ ℕ : k < b < l}
      Example: At what times and in which classification did Bob become an African wild cat?
    2. Semantic intermodal change materialization (MS-CM) returns a list of consecutive archive versions in which a given query Q produces different results between two ontology versions.
      Formally: S-CV(Q, A, Lk, Ll) = {(i, j) | i < j, S-DM(Q, Ai, Aj, Lk, Ll) = (Ω+, Ω), Ω+ ∪ Ω ≠ ∅, ∄ a ∈ ℕ : i < a < j, ∄ b ∈ ℕ : k < b < l}
      Example: At what times did Bob become an African wild cat between last year’s and today’s classification?
    3. Semantic interontological change materialization (MS-CM) returns a list of consecutive language versions in which a given query Q produces different results between two dataset versions.
      Formally: S-CV(Q, Ai, Aj, L) = {(k, l) | k < l, S-DM(Q, Ai, Aj, Lk, Ll) = (Ω+, Ω), Ω+ ∪ Ω ≠ ∅, ∄ a ∈ ℕ : i < a < j, ∄ b ∈ ℕ : k < b < l}
      Example: In which classification versions did Bob become an African wild cat between yesterday and today?

Query Atom Derivations

Based on the semantic query atom extensions that were introduced in last section, we can derive subtypes for the semantic delta materialisation and semantic cross-version join. These subtypes can be used as simplified form of the foundational semantic query atoms.

Semantic delta materialisation

The definition of semantic delta materialization (S-DM) is similar to, but more generic than the semantic diff definition by Völkel et al [2]. While the semantic diff only allows versioning on the dataset, our S-DM definition also enables versioning of the ontology. Similarly, Huang et al. [3] introduce a diff that enables versioning of the ontology, but not on the dataset. As such, our semantic delta materialization definition can be seen as a combination of both. Furthermore, we can express these diff methods in terms of S-DM as follows:

Semantic cross-version join

Similarly, we can define the following derivations of the semantic cross-version join:

Proof of Concept

In order to provide a baseline of the proposed semantic versioned querying atoms, we provide a prototypical implementation of a subset of the semantic versioned query atoms that were introduced in Section 4. In this section, we describe this system, followed by an evaluation description, and a presentation of the results.

System

Our prototype is implemented as an additional semantic layer on top of OSTRICH [9], which is a versioned RDF triple store that offers syntactical versioning. As OSTRICH only supports VM, DM and VQ triple pattern queries at the time of writing, we limit our implementation to the semantic extension of these atoms, namely, S-VM, S-DM and S-VQ triple pattern queries.

As can be seen in Fig. 1, our semantic layer internally uses two OSTRICH stores, one for the versioned dataset, and one for the versioned language.

As a backwards rule-based reasoner, this semantic layer infers additional triples for each S-VM, S-DM or S-VQ query. Rules can be provided in the Notation 3 syntax [24] at query time. It does this through the following steps:

  1. Select applicable rules for the given triple pattern.
  2. Query the dataset for the given triple pattern and version options.
  3. Loop until set of inferred triples stops growing.
    1. Determine rules that can infer new triples
    2. Perform backwards reasoning with these rules by querying the dataset and language for the given triple pattern and version options.
    3. Add newly inferred triples to set of triples
[architecture]

Fig. 1: Architecture overview of our semantic versioned querying layer on top of OSTRICH. The required parameters for the three triple pattern-based queries are indicated, with vd indicating the version of the dataset, and vl indicating the version of the language.

The source code of this prototype can be found on GitHub and is available under the MIT license.

Evaluation

In order to evaluate the performance of our semantic layer, we executed several queries with inferencing of rdfs:subClassOf relationships within the BEAR-B-daily dataset from the BEAR benchmark [6]. The BEAR benchmark evaluates using triple pattern queries, which form the basis of more expressive query evaluation.

To achieve this, we created a derived version of this BEAR-B-daily dataset where we removed all rdf:type relationships from instances to classes that can be inferred through rdfs:subClassOf relationships for each instance. The (44) triples identifying the subclass relationships were stored in a single version inside the language store. As the BEAR-B-daily dataset does not provide any versioning of the language, our evaluation excludes versioning of the language.

As OSTRICH only supports VM, DM and VQ triple pattern queries, we only evaluate their respective semantic extension, always using the single language version. For S-VM, we query the last version, for S-DM, we query between the first and last version, and for S-VQ, we do an intermodal query using the single language version.

The source code of this evaluation can be found on GitHub.

Results

The original BEAR-B-daily dataset contains 48,914 unique triples in 88 dataset versions, while the derived dataset contains 31,761 triples, which is a reduction of 35,07%.

Table 1, Table 2 and Table 3 respectively contain the evaluation results for the S-VM, S-DM and S-VQ queries. The table columns indicate the following:

The results show that the backwards reasoner within our prototype requires almost eight queries to the OSTRICH stores on average for this dataset. The queries to the OSTRICH stores form the main bottleneck.

Query Original Reduced Inferred Inference queries Inferred normalized
dbr:Palazzo_Parisio_(Valletta) 0.56 0.25 2.51 10 0.25
dbr:Singaporean_general_election,_2015 0.47 0.21 2.26 10 0.23
dbr:What_Do_You_Mean? 0.63 0.22 1.76 9 0.20
dbr:Dancing_with_the_Stars_… 0.58 0.23 1.55 6 0.26
dbr:Doctor_Who_(series_9) 0.33 0.15 0.97 6 0.16
dbr:My_Little_Pony… 0.15 0.09 0.94 7 0.13
dbr:2015 0.26 0.12 0.89 6 0.15
Average 0.42 0.18 1.55 7.71 0.20

Table 1: Execution times in milliseconds for S-VM queries against the last dataset version, using the first language version for an 7 S?? triple patterns.

Query Original Reduced Inferred Inference queries Inferred normalized
dbr:Palazzo_Parisio_(Valletta) 0.45 0.21 3.32 10 0.33
dbr:Singaporean_general_election,_2015 0.39 0.18 2.27 10 0.23
dbr:What_Do_You_Mean? 0.34 0.13 2.17 9 0.24
dbr:Dancing_with_the_Stars_… 0.40 0.20 1.06 6 0.18
dbr:Doctor_Who_(series_9) 0.18 0.12 1.14 6 0.19
dbr:My_Little_Pony… 0.12 0.11 2.47 7 0.35
dbr:2015 0.36 0.18 1.84 6 0.31
Average 0.32 0.16 2.04 7.71 0.26

Table 2: Execution times in milliseconds for S-DM queries between the first and last dataset versions, both using the first language version for 7 S?? triple patterns.

Query Original Reduced Inferred Inference queries Inferred normalized
dbr:Palazzo_Parisio_(Valletta) 0.65 0.29 2.48 10 0.25
dbr:Singaporean_general_election,_2015 0.83 0.28 2.16 10 0.22
dbr:What_Do_You_Mean? 0.58 0.17 1.24 9 0.14
dbr:Dancing_with_the_Stars_… 0.43 0.27 0.96 6 0.16
dbr:Doctor_Who_(series_9) 0.26 0.12 0.75 6 0.13
dbr:My_Little_Pony… 0.13 0.09 1.06 7 0.15
dbr:2015 0.24 0.10 1.17 6 0.20
Average 0.45 0.19 1.40 7.71 0.18

Table 3: Execution times in milliseconds for S-VQ queries for 7 S?? triple patterns.

Conclusions

In this work, we introduced fundamental concepts on how to evaluate semantic queries over versioned Linked Datasets. For this, we extended existing structural versioned query atoms by coupling them with language versioning and reasoning.

Our wrapper-based prototypical implementation of these semantic extensions shows that semantic querying, i.e., inference at runtime, is able to significantly reduce storage requirements at the cost of an increase in query time. Together with that, it also brings the additional benefit of being able to select the language version(s) for each query, which would otherwise require dataset duplication when no semantic querying layer is present.

As this is merely a wrapper-based prototype, inference is sub-optimal, and a lot of room for improvement exist. In future work, we foresee improvements regarding the runtime inference based on techniques from the world of OBDA. Furthermore, some RDF stream reasoning techniques could potentially be generalized to work for versioned querying. Native implementations of semantic versioning engines could also reduce the overhead of this wrapper-based approach.

The newly introduced foundational semantic versioned query atoms forms a basis for future research and development of semantic versioned querying within RDF stores, and will consequently enable enhanced analysis over multiple Linked Dataset versions.