Performance Evaluation of the MapReduce-based Parallel Data Preprocessing Algorithm in Web Usage Mining with Robot Detection Approaches,IETE Technical Review

当前位置： X-MOL 学术 › IETE Tech. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance Evaluation of the MapReduce-based Parallel Data Preprocessing Algorithm in Web Usage Mining with Robot Detection Approaches
IETE Technical Review ( IF 2.4 ) Pub Date : 2021-04-28 , DOI: 10.1080/02564602.2021.1918584
Mitali Srivastava ₁ , Atul Kumar Srivastava ₁ , Rakhi Garg ₂ , P. K. Mishra ₃

Affiliation

Data preprocessing is an essential task to prepare suitable target datasets to apply statistical and data mining algorithms. It has become one of the complex segments in Web usage mining due to the massive and unstructured nature of Web server log. The data preprocessing segment in Web usage mining is divided into several phases such as data fusion, data cleaning, user identification, session identification, path completion, and data formatting. This paper focuses on the initial phases of the Web usage mining process, such as data cleaning, user identification, and session identification. Due to the growing size of log data at terabyte and petabyte scale, traditional data preprocessing algorithms fail at scalability points and possess Big Data issues. During the previous few years, the MapReduce framework has evolved as one of the most used parallel programming frameworks for processing Big Data on a cluster of nodes. In this paper, a MapReduce-based data preprocessing algorithm is developed. This algorithm comprises data preprocessing subphases such as data cleaning, user identification, and session identification. Various efficient heuristics are incorporated into existing MapReduce-based data preprocessing algorithm to detect ethical and unethical robots. Further several experiments are performed on a cluster of nodes and found that the proposed MapReduce-based data preprocessing algorithm is efficient and scalable for larger datasets. Moreover, we have also analyzed the impact of robots’ requests on sessions generated in the session identification phase to measure the effectiveness of the proposed approach.

中文翻译：

基于 MapReduce 的并行数据预处理算法在基于机器人检测方法的 Web 使用挖掘中的性能评估

数据预处理是准备合适的目标数据集以应用统计和数据挖掘算法的基本任务。由于Web服务器日志的海量和非结构化性质，它已成为Web使用挖掘中的复杂部分之一。Web使用挖掘中的数据预处理环节分为数据融合、数据清洗、用户识别、会话识别、路径补全、数据格式化等几个阶段。本文重点关注 Web 使用挖掘过程的初始阶段，例如数据清理、用户识别和会话识别。由于 TB 级和 PB 级日志数据的规模不断增长，传统的数据预处理算法在可扩展性点上失败并存在大数据问题。在前几年，MapReduce 框架已经发展成为用于在节点集群上处理大数据的最常用的并行编程框架之一。本文提出了一种基于 MapReduce 的数据预处理算法。该算法包括数据预处理子阶段，如数据清洗、用户识别和会话识别。各种有效的启发式方法被纳入现有的基于 MapReduce 的数据预处理算法中，以检测道德和不道德的机器人。在节点集群上进行了进一步的几个实验，发现所提出的基于 MapReduce 的数据预处理算法对于更大的数据集是有效且可扩展的。此外，我们还分析了机器人请求对会话识别阶段生成的会话的影响，以衡量所提出方法的有效性。

更新日期：2021-04-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>