Challenges in benchmarking stream learning algorithms with real-world data,Data Mining and Knowledge Discovery

当前位置： X-MOL 学术 › Data Min. Knowl. Discov. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Challenges in benchmarking stream learning algorithms with real-world data
Data Mining and Knowledge Discovery ( IF 4.8 ) Pub Date : 2020-07-07 , DOI: 10.1007/s10618-020-00698-5
Vinicius M. A. Souza , Denis M. dos Reis , André G. Maletzke , Gustavo E. A. P. A. Batista

Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data. The main characteristics of these applications are the online arrival of data observations at high speed and the susceptibility to changes in the data distributions due to the dynamic nature of real environments. The data stream mining community still faces some primary challenges and difficulties related to the comparison and evaluation of new proposals, mainly due to the lack of publicly available high quality non-stationary real-world datasets. The comparison of stream algorithms proposed in the literature is not an easy task, as authors do not always follow the same recommendations, experimental evaluation procedures, datasets, and assumptions. In this paper, we mitigate problems related to the choice of datasets in the experimental evaluation of stream classifiers and drift detectors. To that end, we propose a new public data repository for benchmarking stream algorithms with real-world data. This repository contains the most popular datasets from literature and new datasets related to a highly relevant public health problem that involves the recognition of disease vector insects using optical sensors. The main advantage of these new datasets is the prior knowledge of their characteristics and patterns of changes to adequately evaluate new adaptive algorithms. We also present an in-depth discussion about the characteristics, reasons, and issues that lead to different types of changes in data distribution, as well as a critical review of common problems concerning the current benchmark datasets available in the literature.

中文翻译：

使用真实数据对流学习算法进行基准测试的挑战

流数据越来越多地出现在实际应用中，例如传感器测量，卫星数据馈送，股票市场和金融数据。这些应用程序的主要特征是高速在线观测数据，以及由于实际环境的动态特性而易于发生数据分布变化的情况。数据流挖掘社区仍面临与新提案的比较和评估相关的一些主要挑战和困难，这主要是由于缺乏公开可用的高质量非平稳现实世界数据集。比较文献中提出的流算法并非易事，因为作者并不总是遵循相同的建议，实验评估程序，数据集和假设。在本文中，在流分类器和漂移检测器的实验评估中，我们减轻了与数据集选择有关的问题。为此，我们提出了一个新的公共数据存储库，用于使用实际数据对流算法进行基准测试。该存储库包含来自文献的最受欢迎的数据集和与高度相关的公共卫生问题相关的新数据集，其中涉及使用光学传感器识别病媒昆虫。这些新数据集的主要优点是对它们的特征和变化模式具有先验知识，可以充分评估新的自适应算法。我们还将对导致数据分布发生不同类型变化的特征，原因和问题进行深入讨论，

更新日期：2020-07-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>