当前位置:
X-MOL 学术
›
arXiv.cs.DB
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ML Based Lineage in Databases
arXiv - CS - Databases Pub Date : 2021-09-13 , DOI: arxiv-2109.06339 Michael Leybovich, Oded Shmueli
arXiv - CS - Databases Pub Date : 2021-09-13 , DOI: arxiv-2109.06339 Michael Leybovich, Oded Shmueli
In this work, we track the lineage of tuples throughout their database
lifetime. That is, we consider a scenario in which tuples (records) that are
produced by a query may affect other tuple insertions into the DB, as part of a
normal workflow. As time goes on, exact provenance explanations for such tuples
become deeply nested, increasingly consuming space, and resulting in decreased
clarity and readability. We present a novel approach for approximating lineage
tracking, using a Machine Learning (ML) and Natural Language Processing (NLP)
technique; namely, word embedding. The basic idea is summarizing (and
approximating) the lineage of each tuple via a small set of constant-size
vectors (the number of vectors per-tuple is a hyperparameter). Therefore, our
solution does not suffer from space complexity blow-up over time, and it
"naturally ranks" explanations to the existence of a tuple. We devise an
alternative and improved lineage tracking mechanism, that of keeping track of
and querying lineage at the column level; thereby, we manage to better
distinguish between the provenance features and the textual characteristics of
a tuple. We integrate our lineage computations into the PostgreSQL system via
an extension (ProvSQL) and experimentally exhibit useful results in terms of
accuracy against exact, semiring-based, justifications. In the experiments, we
focus on tuples with multiple generations of tuples in their lifelong lineage
and analyze them in terms of direct and distant lineage. The experiments
suggest a high usefulness potential for the proposed approximate lineage
methods and the further suggested enhancements. This especially holds for the
column-based vectors method which exhibits high precision and high per-level
recall.
中文翻译:
数据库中基于机器学习的沿袭
在这项工作中,我们在整个数据库生命周期中跟踪元组的沿袭。也就是说,我们考虑这样一种场景,其中查询生成的元组(记录)可能会影响其他元组插入到数据库中,作为正常工作流程的一部分。随着时间的推移,对此类元组的确切出处解释变得深入嵌套,占用空间越来越大,并导致清晰度和可读性降低。我们提出了一种使用机器学习 (ML) 和自然语言处理 (NLP) 技术来近似谱系跟踪的新方法;即词嵌入。基本思想是通过一小组固定大小的向量(每个元组的向量数量是一个超参数)总结(和近似)每个元组的谱系。因此,我们的解决方案不会随着时间的推移而遭受空间复杂性爆炸的影响,并且“
更新日期:2021-09-15
中文翻译:
数据库中基于机器学习的沿袭
在这项工作中,我们在整个数据库生命周期中跟踪元组的沿袭。也就是说,我们考虑这样一种场景,其中查询生成的元组(记录)可能会影响其他元组插入到数据库中,作为正常工作流程的一部分。随着时间的推移,对此类元组的确切出处解释变得深入嵌套,占用空间越来越大,并导致清晰度和可读性降低。我们提出了一种使用机器学习 (ML) 和自然语言处理 (NLP) 技术来近似谱系跟踪的新方法;即词嵌入。基本思想是通过一小组固定大小的向量(每个元组的向量数量是一个超参数)总结(和近似)每个元组的谱系。因此,我们的解决方案不会随着时间的推移而遭受空间复杂性爆炸的影响,并且“