Continuous outlier mining of streaming data in flink,Information Systems

当前位置： X-MOL 学术 › Inform. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Continuous outlier mining of streaming data in flink
Information Systems ( IF 3.7 ) Pub Date : 2020-05-29 , DOI: 10.1016/j.is.2020.101569
Theodoros Toliopoulos , Anastasios Gounaris , Kostas Tsichlas , Apostolos Papadopoulos , Sandra Sampaio

In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and investigates the challenges in transferring state-of-the-art techniques to Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied, of which a micro-clustering-based one is the most efficient. We show speed-ups of up to 2.27 times over advanced non-parallel solutions, by using just an ordinary four-core machine and a real-world dataset. When moving to a three-machine cluster, due to less contention, we manage to achieve both better scalability in terms of the window slide size and the data dimensionality, and even higher speed-ups, e.g., by a factor of more than 11X. Overall, our results demonstrate that outlier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available as open-source software.

中文翻译：

在flink中连续进行异常数据挖掘

在这项工作中，我们专注于基于距离的离群值在度量空间中，关于某个实体是否为异常值的状态取决于其附近其他实体的数量。近年来，有几种解决方案解决了数据流中基于距离的离群值的问题，其中随着新元素的出现，必须不断挖掘离群值。一个有趣的研究问题是将流环境与大规模并行系统相结合，以提供基于流的可伸缩算法。但是，以前提出的技术均未涉及大规模并行设置。我们的建议填补了这一空白，并研究了将最新技术转移到Apache Flink（一种用于密集流分析的现代平台）的挑战。我们彻底介绍了遇到的技术挑战以及可能适用的替代方法，其中基于微集群的方法最为有效。通过仅使用普通的四核计算机和真实的数据集，我们显示出比先进的非并行解决方案最多可将速度提高2.27倍。当迁移到三机集群时，由于争用较少，我们设法在窗口幻灯片大小和数据维数方面实现了更好的可伸缩性，甚至实现了更高的加速率，例如，提高了11倍以上。总体而言，我们的结果表明，可以有效且可扩展的方式实现离群挖掘。产生的技术已作为开源软件公开提供。由于争用较少，我们设法在窗口幻灯片大小和数据维数方面实现了更好的可伸缩性，并且甚至实现了更高的加速率，例如提高了11倍以上。总体而言，我们的结果表明，可以有效且可扩展的方式实现离群挖掘。产生的技术已作为开源软件公开提供。由于争用减少，我们设法在窗口幻灯片大小和数据维数方面实现了更好的可伸缩性，甚至实现了更高的加速率，例如，提高了11倍以上。总体而言，我们的结果表明，可以有效且可扩展的方式实现离群挖掘。产生的技术已作为开源软件公开提供。

更新日期：2020-05-29

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>