当前位置:
X-MOL 学术
›
arXiv.cs.DB
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Similarity Driven Approximation for Text Analytics
arXiv - CS - Databases Pub Date : 2019-10-16 , DOI: arxiv-1910.07144 Guangyan Hu, Yongfeng Zhang, Sandro Rigo, Thu D. Nguyen
arXiv - CS - Databases Pub Date : 2019-10-16 , DOI: arxiv-1910.07144 Guangyan Hu, Yongfeng Zhang, Sandro Rigo, Thu D. Nguyen
Text analytics has become an important part of business intelligence as
enterprises increasingly seek to extract insights for decision making from text
data sets. Processing large text data sets can be computationally expensive,
however, especially if it involves sophisticated algorithms. This challenge is
exacerbated when it is desirable to run different types of queries against a
data set, making it expensive to build multiple indices to speed up query
processing. In this paper, we propose and evaluate a framework called EmApprox
that uses approximation to speed up the processing of a wide range of queries
over large text data sets. The key insight is that different types of queries
can be approximated by processing subsets of data that are most similar to the
queries. EmApprox builds a general index for a data set by learning a natural
language processing model, producing a set of highly compressed vectors
representing words and subcollections of documents. Then, at query processing
time, EmApprox uses the index to guide sampling of the data set, with the
probability of selecting each subcollection of documents being proportional to
its {\em similarity} to the query as computed using the vector representations.
We have implemented a prototype of EmApprox as an extension of the Apache Spark
system, and used it to approximate three types of queries: aggregation,
information retrieval, and recommendation. Experimental results show that
EmApprox's similarity-guided sampling achieves much better accuracy than random
sampling. Further, EmApprox can achieve significant speedups if users can
tolerate small amounts of inaccuracies. For example, when sampling at 10\%,
EmApprox speeds up a set of queries counting phrase occurrences by almost 10x
while achieving estimated relative errors of less than 22\% for 90\% of the
queries.
中文翻译:
文本分析的相似性驱动逼近
随着企业越来越多地寻求从文本数据集中提取决策洞察力,文本分析已成为商业智能的重要组成部分。然而,处理大型文本数据集的计算成本可能很高,尤其是当它涉及复杂的算法时。当需要对数据集运行不同类型的查询时,这一挑战就会加剧,这使得构建多个索引以加速查询处理的成本很高。在本文中,我们提出并评估了一个名为 EmApprox 的框架,该框架使用近似来加速对大型文本数据集的各种查询的处理。关键的见解是,可以通过处理与查询最相似的数据子集来近似不同类型的查询。EmApprox 通过学习自然语言处理模型为数据集构建通用索引,生成一组表示单词和文档子集的高度压缩向量。然后,在查询处理时,EmApprox 使用索引来指导数据集的采样,选择文档的每个子集合的概率与其{\em 相似性}与使用向量表示计算的查询成正比。我们已经实现了 EmApprox 的原型作为 Apache Spark 系统的扩展,并使用它来近似三种类型的查询:聚合、信息检索和推荐。实验结果表明,EmApprox 的相似性引导抽样比随机抽样获得了更好的准确率。更多,如果用户可以容忍少量的不准确,EmApprox 可以实现显着的加速。例如,当以 10\% 采样时,EmApprox 将一组查询短语出现次数的速度提高了近 10 倍,同时对 90\% 的查询实现了小于 22\% 的估计相对误差。
更新日期:2020-01-14
中文翻译:
文本分析的相似性驱动逼近
随着企业越来越多地寻求从文本数据集中提取决策洞察力,文本分析已成为商业智能的重要组成部分。然而,处理大型文本数据集的计算成本可能很高,尤其是当它涉及复杂的算法时。当需要对数据集运行不同类型的查询时,这一挑战就会加剧,这使得构建多个索引以加速查询处理的成本很高。在本文中,我们提出并评估了一个名为 EmApprox 的框架,该框架使用近似来加速对大型文本数据集的各种查询的处理。关键的见解是,可以通过处理与查询最相似的数据子集来近似不同类型的查询。EmApprox 通过学习自然语言处理模型为数据集构建通用索引,生成一组表示单词和文档子集的高度压缩向量。然后,在查询处理时,EmApprox 使用索引来指导数据集的采样,选择文档的每个子集合的概率与其{\em 相似性}与使用向量表示计算的查询成正比。我们已经实现了 EmApprox 的原型作为 Apache Spark 系统的扩展,并使用它来近似三种类型的查询:聚合、信息检索和推荐。实验结果表明,EmApprox 的相似性引导抽样比随机抽样获得了更好的准确率。更多,如果用户可以容忍少量的不准确,EmApprox 可以实现显着的加速。例如,当以 10\% 采样时,EmApprox 将一组查询短语出现次数的速度提高了近 10 倍,同时对 90\% 的查询实现了小于 22\% 的估计相对误差。