当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Similarity Driven Approximation for Text Analytics
arXiv - CS - Databases Pub Date : 2019-10-16 , DOI: arxiv-1910.07144
Guangyan Hu, Yongfeng Zhang, Sandro Rigo, Thu D. Nguyen

Text analytics has become an important part of business intelligence as enterprises increasingly seek to extract insights for decision making from text data sets. Processing large text data sets can be computationally expensive, however, especially if it involves sophisticated algorithms. This challenge is exacerbated when it is desirable to run different types of queries against a data set, making it expensive to build multiple indices to speed up query processing. In this paper, we propose and evaluate a framework called EmApprox that uses approximation to speed up the processing of a wide range of queries over large text data sets. The key insight is that different types of queries can be approximated by processing subsets of data that are most similar to the queries. EmApprox builds a general index for a data set by learning a natural language processing model, producing a set of highly compressed vectors representing words and subcollections of documents. Then, at query processing time, EmApprox uses the index to guide sampling of the data set, with the probability of selecting each subcollection of documents being proportional to its {\em similarity} to the query as computed using the vector representations. We have implemented a prototype of EmApprox as an extension of the Apache Spark system, and used it to approximate three types of queries: aggregation, information retrieval, and recommendation. Experimental results show that EmApprox's similarity-guided sampling achieves much better accuracy than random sampling. Further, EmApprox can achieve significant speedups if users can tolerate small amounts of inaccuracies. For example, when sampling at 10\%, EmApprox speeds up a set of queries counting phrase occurrences by almost 10x while achieving estimated relative errors of less than 22\% for 90\% of the queries.

中文翻译:

文本分析的相似性驱动逼近

随着企业越来越多地寻求从文本数据集中提取决策洞察力,文本分析已成为商业智能的重要组成部分。然而,处理大型文本数据集的计算成本可能很高,尤其是当它涉及复杂的算法时。当需要对数据集运行不同类型的查询时,这一挑战就会加剧,这使得构建多个索引以加速查询处理的成本很高。在本文中,我们提出并评估了一个名为 EmApprox 的框架,该框架使用近似来加速对大型文本数据集的各种查询的处理。关键的见解是,可以通过处理与查询最相似的数据子集来近似不同类型的查询。EmApprox 通过学习自然语言处理模型为数据集构建通用索引,生成一组表示单词和文档子集的高度压缩向量。然后,在查询处理时,EmApprox 使用索引来指导数据集的采样,选择文档的每个子集合的概率与其{\em 相似性}与使用向量表示计算的查询成正比。我们已经实现了 EmApprox 的原型作为 Apache Spark 系统的扩展,并使用它来近似三种类型的查询:聚合、信息检索和推荐。实验结果表明,EmApprox 的相似性引导抽样比随机抽样获得了更好的准确率。更多,如果用户可以容忍少量的不准确,EmApprox 可以实现显着的加速。例如,当以 10\% 采样时,EmApprox 将一组查询短语出现次数的速度提高了近 10 倍,同时对 90\% 的查询实现了小于 22\% 的估计相对误差。
更新日期:2020-01-14
down
wechat
bug