当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data
arXiv - CS - Databases Pub Date : 2020-09-09 , DOI: arxiv-2009.04540
Daniel Kang, John Guibas, Peter Bailis, Tatsunori Hashimoto, Matei Zaharia

Unstructured data is now commonly queried by using target deep neural networks (DNNs) to produce structured information, e.g., object types and positions in video. As these target DNNs can be computationally expensive, recent work uses proxy models to produce query-specific proxy scores. These proxy scores are then used in downstream query processing algorithms for improved query execution speeds. Unfortunately, proxy models are often trained per-query, require large amounts of training data from the target DNN, and new training methods per query type. In this work, we develop an index construction method (task-agnostic semantic trainable index, TASTI) that produces reusable embeddings that can be used to generate proxy scores for a wide range of queries, removing the need for query-specific proxies. We observe that many queries over the same dataset only require access to the schema induced by the target DNN. For example, an aggregation query counting the number of cars and a selection query selecting frames of cars require only the object types per frame of video. To leverage this opportunity, TASTI produces embeddings per record that have the key property that close embeddings have similar extracted attributes under the induced schema. Given this property, we show that clustering by embeddings can be used to answer downstream queries efficiently. We theoretically analyze TASTI and show that low training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on four video and text datasets, and three query types. We show that TASTI can be 10x less expensive to construct than proxy models and can outperform them by up to 24x at query time.

中文翻译:

基于深度学习的非结构化数据查询的任务无关索引

现在通常使用目标深度神经网络 (DNN) 来查询非结构化数据,以生成结构化信息,例如视频中的对象类型和位置。由于这些目标 DNN 的计算成本可能很高,因此最近的工作使用代理模型来生成特定于查询的代理分数。然后在下游查询处理算法中使用这些代理分数以提高查询执行速度。不幸的是,代理模型通常是针对每个查询进行训练的,需要来自目标 DNN 的大量训练数据,以及每种查询类型的新训练方法。在这项工作中,我们开发了一种索引构建方法(与任务无关的语义可训练索引,TASTI),该方法产生可重用的嵌入,可用于为广泛的查询生成代理分数,消除对查询特定代理的需要。我们观察到对同一数据集的许多查询只需要访问由目标 DNN 诱导的模式。例如,计算汽车数量的聚合查询和选择汽车帧的选择查询仅需要每帧视频的对象类型。为了利用这个机会,TASTI 为每条记录生成嵌入,这些嵌入具有关键属性,即闭合嵌入在诱导模式下具有相似的提取属性。鉴于此属性,我们表明通过嵌入进行聚类可用于有效地回答下游查询。我们从理论上分析了 TASTI,并表明低训练错误保证了自然类查询的下游查询准确性。我们在四个视频和文本数据集以及三种查询类型上评估 TASTI。
更新日期:2020-09-11
down
wechat
bug