当前位置: X-MOL 学术Inf. Syst. Front. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TextBenDS: a Generic Textual Data Benchmark for Distributed Systems
Information Systems Frontiers ( IF 5.9 ) Pub Date : 2020-03-06 , DOI: 10.1007/s10796-020-09999-y
Ciprian-Octavian Truică , Elena-Simona Apostol , Jérôme Darmont , Ira Assent

Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.



中文翻译:

TextBenDS:分布式系统的通用文本数据基准

使用加权方案提取top- k关键字和文档是文本挖掘和机器学习中用于不同分析和检索任务的流行技术。权重通常在数据预处理步骤中计算,因为它们的更新和跟踪在数据集中执行的所有修改的成本很高。此外,当仅分析数据集的子集时会引入计算错误,即,由于加权方案使用文档数来对关键字和文档进行评分,因此计算出错误的加权。因此,在大数据环境中,至关重要的是降低计算加权方案的运行时间,而又不影响分析过程和机器学习算法的准确性。为了满足计算top k的任务的这一要求关键字和文档(在很大程度上依赖于加权方案),习惯上设计基准来比较分布式框架和数据库管理系统的各种配置中的加权方案。因此,我们提出了T ext B en DS-一种通用的面向文档的基准,用于存储文本数据和构建加权方案。我们的基准测试提供了一种通用数据模型,该模型采用多维方法设计,用于存储文本文档。我们还建议使用具有各种复杂性和选择性的聚合查询来构建术语加权方案,该方案可用于提取top- k关键字和文档。我们评估在Apache Hadoop生态系统内设置的几个分布式环境上查询的计算性能。我们的实验结果提供了有趣的见解。例如,MongoDB表现出最佳的整体性能,而无论加权方案如何,Spark的执行时间几乎保持不变。

更新日期:2020-04-21
down
wechat
bug