gStore 之 VS*-Tree

gStore:

gStore是由北京大学计算机所数据管理实验室研发面向RDF知识图谱的开源图数据库系统(通常称为Triple Store)。不同于传统基于关系数据库的知识图谱数据管理方法,gStore是一种原生基于图数据模型(Native Graph Model)的RDF数据管理系统,维持了原始RDF知识图谱的图结构;其数据模型是有标签、有向的多边图,每个顶点对应着一个主体或客体。我们将面向RDF的SPARQL查询,转换为面向RDF图的子图匹配查询,利用所提出的基于图结构的索引(VS*-tree)来加速查询的性能。

gStore 拥有以下特性:

  1. gStore从图数据库角度存储和检索RDF知识图谱数据;
  2. gStore支持W3C定义的SPARQL 1.1标准,包括含有Union,OPTIONAL,FILTER和聚集函数的查询;gStore支持有效的增删改操作;
  3. gStore单机可以支持1Billion(十亿)三元组规模的RDF知识图谱的数据管理任务;
我们组准备将用户商品评论数据使用gStore存储,在此基础上开发出适合用于检索用户特性的功能,这篇博客是对gStore中用于加速查询效率的VS*-tree树进行介绍,内容来自北大邹磊教授2014 VLDB论文《gStore:a graph-based SPARQL query engine》
(博客虽然大多英文,但稍有总结概述)

VS*-tree:

VS*-tree is 是一个作用于G*上的索引结构,用于高效地完成SPARQL查询,即找到Q*在G*上的matches;

Background:

Definition of match of Q* over G*:Q*= {v1… vn},A set of n distinct vertices {u1… un} in G*;

gStore 之 VS*-Tree

Solution:

There is a straightforward method consisting of two steps,its first step is a classical inclusion query and the second step is a multiway join process;
Problem need to be solved:reduce the search space for inclusion query
S-tree is proposed:
  1. A height-balanced tree similar to B+ tree;
  2. Each intermediate node is formed by ORing all child signatures in S-tree;
    gStore 之 VS*-Tree
Problem need to be solved: S-tree cannot support the second step, which is a NP-hard;
VS*-tree(based on S-tree) is proposed:
VS*-tree is a multi-resolution summary graph based on S-tree that can be used to reduce the search space of subgraph query processing;
Properties:
  1. Each path from the root to any leaf node has the same length h, which is the height of the VS∗-tree;
  2. Each leaf node corresponds to a vertex in G∗ and has a pointer to it.
  3. … …
    … …
    gStore 之 VS*-Tree
The example of building “super-edge”:

gStore 之 VS*-Tree

Definition of summary match of Q* over G^I:gStore 之 VS*-Tree

Function:

Summary matches can be used to reduce the search space for subgraph search over G*;
Problem need to be solved:

The vertex encoding strategy may lead to some vertex signatures having too many 1,these
vertices can match any query vertex signature;

Solution:

We partition all of u’s neighbors into n groups {g1,…, gn} and each group gi corresponds to one instance of vertex u, denoted as u[i];
gStore 之 VS*-Tree
gStore 之 VS*-Tree

Updates to VS*-tree:

1. Inseration:

The process begins at the root of VS*-tree and iteratively chooses a child node until it reaches a suitable leaf node location where u is inserted;
Then, the summary graph at the leaf level is updated and new super edge may need to be indtoduced;
The change of leaf signature and leaf summary graph must be propagated upwards within the VS*-tree;

Rule:

The location of new node depend on both node signature & G*’s signature;

gStore 之 VS*-Tree

After inserting, the signature of that leaf node and super edges adjacent to it may be updated, and the change must be propagated upwards within the VS*-tree;

2. Split:

Like other height balanced trees, insertion into a node that is already full will invoke a node split;

Rule:

At each iteration, we select as the seed nodes two entities from among the B + 1. We allocate the remaining entities to these two nodes according to Equation 1.
Then, we have to update the signatures & the super edges associated with the two new nodes, and the change will be also propagated to the root of the VS∗-tree;

3. Deleation

Rule:

To delete a vertex u from VS*-tree, we find the leaf node d where u is stored, and delete u, and then adopt a bottom-up strategy to update the signature of and super edges associated with the nodes;
If some node d has less than b entries, then d is deleted and its entries are reinserted into VS*-tree;