当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping.
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2013-09-01
Thanawin Rakthanmanon 1 , Bilson Campana 2 , Abdullah Mueen 2 , Gustavo Batista 3 , Brandon Westover 4 , Qiang Zhu 2 , Jesin Zakaria 2 , Eamonn Keogh 2
Affiliation  

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms, including classification, clustering, motif discovery, anomaly detection, and so on. The difficulty of scaling a search to large datasets explains to a great extent why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine massive time series for the first time. We demonstrate the following unintuitive fact: in large datasets we can exactly search under Dynamic Time Warping (DTW) much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We explain how our ideas allow us to solve higher-level time series data mining problems such as motif discovery and clustering at scales that would otherwise be untenable. Moreover, we show how our ideas allow us to efficiently support the uniform scaling distance measure, a measure whose utility seems to be underappreciated, but which we demonstrate here. In addition to mining massive datasets with up to one trillion datapoints, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

中文翻译:

解决大数据时间序列:在动态时间扭曲下挖掘数万亿个时间序列子序列。

大多数时间序列数据挖掘算法都使用相似性搜索作为核心子程序,因此相似性搜索所花费的时间几乎是所有时间序列数据挖掘算法的瓶颈,包括分类、聚类、主题发现、异常检测等。将搜索扩展到大型数据集的困难在很大程度上解释了为什么大多数关于时间序列数据挖掘的学术工作在考虑数百万个时间序列对象时停滞不前,而许多工业和科学则依赖于数十亿个时间序列对象等待被探索。在这项工作中,我们展示了通过结合使用四种新颖的想法,我们可以首次搜索和挖掘大量时间序列。我们证明了以下不直观的事实:在大型数据集中,我们可以比当前最先进的欧几里德距离搜索算法更快地在动态时间规整(DTW)下进行精确搜索。我们在有史以来尝试过的最大的时间序列实验中展示了我们的工作。特别是,我们考虑的最大数据集大于曾经发表的所有数据挖掘论文中考虑的所有时间序列数据集的总大小。我们解释了我们的想法如何使我们能够解决更高级别的时间序列数据挖掘问题,例如主题发现和大规模聚类,否则这些问题是站不住脚的。此外,我们展示了我们的想法如何使我们能够有效地支持统一缩放距离度量,这种度量的实用性似乎被低估,但我们在这里进行了演示。除了挖掘具有多达一万亿个数据点的海量数据集之外,我们还将展示我们的想法对于数据流的实时监控也有影响,使我们能够处理更快的到达率和/或使用比现有设备更便宜和功率更低的设备。目前可能。
更新日期:2019-11-01
down
wechat
bug