当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data
Big Data Research ( IF 3.3 ) Pub Date : 2019-04-17 , DOI: 10.1016/j.bdr.2019.04.001
Roberto Corizzo , Michelangelo Ceci , Nathalie Japkowicz

The increasing presence of geo-distributed sensor networks implies the generation of huge volumes of data from multiple geographical locations at an increasing rate. This raises important issues which become more challenging when the final goal is that of the analysis of the data for forecasting purposes or, more generally, for predictive tasks. This paper proposes a framework which supports predictive modeling tasks from streaming data coming from multiple geo-referenced sensors. In particular, we propose a distance-based anomaly detection strategy which considers objects described by embedding features learned via a stacked auto-encoder. We then devise a repair strategy which repairs the data detected as anomalous exploiting non-anomalous data measured by sensors in nearby spatial locations. Subsequently, we adopt Gradient Boosted Trees (GBTs) to predict/forecast values assumed by a target variable of interest for the repaired newly arriving (unlabeled) data, using the original feature representation or the embedding feature representation learned via the stacked auto-encoder. The workflow is implemented with distributed Apache Spark programming primitives and tested on a cluster environment. We perform experiments to assess the performance of each module, separately and in a combined manner, considering the predictive modeling of one-day-ahead energy production, for multiple renewable energy sites. Accuracy results show that the proposed framework allows reducing the error up to 13.56%. Moreover, scalability results demonstrate the efficiency of the proposed framework in terms of speedup, scaleup and execution time under a stress test.



中文翻译:

地理分布大数据中准确预测的异常检测和修复

地理分布传感器网络的不断增加意味着从多个地理位置以越来越高的速度生成大量数据。这就提出了重要的问题,当最终目标是为了预测目的或更一般而言用于预测任务的数据分析时,这些挑战变得更具挑战性。本文提出了一个框架,该框架支持来自多个地理参考传感器的流数据的预测建模任务。特别是,我们提出了一种基于距离的异常检测策略,该策略考虑了通过嵌入通过堆叠自动编码器学习的特征而描述的对象。然后,我们设计一种修复策略,利用由附近空间位置的传感器测量的非异常数据来修复检测为异常的数据。后来,我们采用梯度提升树(GBT)来预测/预测由目标目标变量假定的值,以使用通过堆叠自动编码器学习的原始特征表示或嵌入特征表示来修复新到达(未标记)的数据。该工作流使用分布式Apache Spark编程原语实现,并在集群环境中进行了测试。我们进行实验,以评估多个可再生能源站点未来一天能源生产的预测模型,分别或以组合方式评估每个模块的性能。准确性结果表明,所提出的框架可以将误差减少多达13.56%。此外,可伸缩性结果证明了所提出框架在压力测试下在加速,扩展和执行时间方面的效率。

更新日期:2019-04-17
down
wechat
bug