当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Recurring concept memory management in data streams: exploiting data stream concept evolution to improve performance and transparency
Data Mining and Knowledge Discovery ( IF 4.8 ) Pub Date : 2021-02-14 , DOI: 10.1007/s10618-021-00736-w
Ben Halstead , Yun Sing Koh , Patricia Riddle , Russel Pears , Mykola Pechenizkiy , Albert Bifet

A data stream is a sequence of observations produced by a generating process which may evolve over time. In such a time-varying stream the relationship between input features and labels, or concepts, can change. Adapting to changes in concept is most often done by destroying and incrementally rebuilding the current classifier. Many systems additionally store and reuse previously built models to more efficiently adapt when stream conditions drift to a previously seen state. Reusing a model offers increased classification performance over rebuilding, and provides an indicator, or transparency, into the hidden state of the generating process. When only a subset of past models can be stored for reuse, for example due to memory constraints, the choice of which models to store for optimal future reuse is an important problem. Current methods of evaluating which models to store use valuation policies such as age, time since last use, accuracy and diversity. These policies are often not optimal, losing predictive performance by undervaluing complex models. We propose a new valuation policy based on advantage, the misclassifications avoided by reusing a model rather than training a new model, which more accurately reflects the true value of model storage. We evaluate our method on synthetic and real world data, including a real world air pollution dataset. Our results show accuracy increases of up to 6% using our valuation policy, while preserving transparency.



中文翻译:

数据流中的定期概念内存管理:利用数据流概念演变来提高性能和透明度

数据流是由生成过程生成的一系列观察结果,这些观察结果可能会随着时间的推移而演变。在这种随时间变化的流中,输入要素与标签或概念之间的关系可能会发生变化。适应概念的更改通常是通过销毁并逐步重建当前分类器来完成的。许多系统还额外存储和重用以前构建的模型,以在流条件漂移到以前看到的状态时更有效地进行调整。重用模型可提供比重建更高的分类性能,并提供指标或透明度,进入生成过程的隐藏状态。当例如由于内存限制而只能存储过去模型的一个子集以进行重用时,选择存储哪些模型以实现最佳将来重用是一个重要的问题。当前评估存储哪些模型的方法使用评估策略,例如年龄,自上次使用以来的时间,准确性和多样性。这些策略通常不是最佳的,因为低估了复杂的模型而失去了预测性能。我们基于优势提出了新的估值政策,可以通过重用模型而不是训练新模型来避免错误分类,这样可以更准确地反映模型存储的真实价值。我们根据合成和真实世界的数据(包括真实世界的空气污染数据集)评估我们的方法。我们的结果表明,使用我们的评估政策,准确性提高了6%,同时保持了透明度

更新日期:2021-02-15
down
wechat
bug