当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Tracing outliers in the dataset of Drosophila suzukii records with the Isolation Forest method
Journal of Big Data ( IF 8.1 ) Pub Date : 2020-03-05 , DOI: 10.1186/s40537-020-00288-8
Ugo Santosuosso , Alessandro Cini , Alessio Papini

The analysis of big data is a fundamental challenge for the current and future stream of data coming from many different sources. Geospatial data is one of the sources currently less investigated. A typical example of always increasing data set is that produced by the distribution data of invasive species on the concerned territories. The dataset of Drosophila suzuki invasion sites in Europe up to 2011 was used to test a possible method to pinpoint its outliers (anomalies). Our aim was to find a method of analysis that would be able to treat large amount of data in order to produce easily readable outputs to summarize and predict the status and, possibly, the future development of a biological invasion. To do that, we aimed to identify the so called anomalies of the dataset, identified with a Python script based on the machine learning algorithm “Isolation Forest”. We used also the K-Means clustering method to partition the dataset. In our test, based on a real dataset, the Silhouette method yielded a number of clusters of 10 as the best result. The clusters were drawn on the map with a Voronoi tessellation, showing that 8 clusters were centered on industrial harbours, while the last two were in the hinterland. This fact led us to guess that: (1) the main entrance mechanisms in Europe may be the wares import fluxes through ports, occurring apparently several times; (2) the spreading into the inland may be due to road transportation of wares; (3) the outliers (anomalies) found with the isolation forest method would identify individuals or populations that tend to detach from their original cluster and hence represent indications about the lines of further spreading of the invasion. This type of analysis aims hence to identify the future direction of an invasion, rather than the center of origin as in the case of geographic profiling. Isolation Forest provides therefore complimentary results with respect to PGP. The recent records of the invasive species, mainly localized close to the outliers position, are an indication that the isolation forest method can be considered predictive and proved to be a useful method to treat large datasets of geospatial data.



中文翻译:

隔离林方法在果蝇铃木记录数据集中追踪异常值

对于来自许多不同来源的当前和未来数据流,大数据分析是一项基本挑战。地理空间数据是目前研究较少的来源之一。不断增加的数据集的一个典型例子是由有关领土上的入侵物种的分布数据产生的。铃木果蝇的数据集使用欧洲直到2011年的入侵站点来测试确定其异常值(异常)的可能方法。我们的目标是找到一种分析方法,该方法将能够处理大量数据,以便产生易于读取的输出,以总结和预测生物入侵的状况以及可能的未来发展。为此,我们旨在识别数据集的所谓异常,该异常是使用基于机器学习算法“隔离林”的Python脚本识别的。我们还使用K-Means聚类方法对数据集进行分区。在我们的测试中,基于真实数据集,Silhouette方法产生的10个簇是最佳结果。这些群集是通过Voronoi细分在地图上绘制的,表明8个群集以工业港为中心,而最后两个人在内地。这个事实使我们猜测:(1)欧洲的主要进入机制可能是商品通过港口进口的通量,这种情况显然发生了几次。(二)向内陆扩散的原因可能是商品的公路运输;(3)通过隔离森林方法发现的异常值(异常)将识别倾向于从其原始群集分离的个体或种群,从而代表有关入侵进一步扩散的路线的迹象。因此,这种类型的分析旨在确定入侵的未来方向,而不是像地理分布图一样确定起源中心。因此,隔离林在PGP方面提供了互补的结果。入侵物种的最新记录,主要位于离群值附近

更新日期:2020-04-21
down
wechat
bug