Unsupervised and scalable subsequence anomaly detection in large data series

Boniol, Paul; Linardi, Michele; Roncallo, Federico; Palpanas, Themis; Meftah, Mohammed; Remy, Emmanuel

doi:10.1007/s00778-021-00655-8

Unsupervised and scalable subsequence anomaly detection in large data series

Regular Paper
Published: 03 March 2021

Volume 30, pages 909–931, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Paul Boniol ORCID: orcid.org/0000-0001-8516-0123¹,
Michele Linardi²,
Federico Roncallo²,
Themis Palpanas²,
Mohammed Meftah¹ &
…
Emmanuel Remy¹

2322 Accesses
21 Citations
Explore all metrics

A Correction to this article was published on 31 August 2021

This article has been updated

Abstract

Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, the approaches that have been proposed so far in the literature have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this work, we address these problems and propose NormA, a novel approach, suitable for domain-agnostic anomaly detection. NormA is based on a new data series primitive, which permits to detect anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach correctly identifies all single and recurrent anomalies of various types, with no prior knowledge of the characteristics of these anomalies (except for their length). Moreover, it outperforms by a large margin the current state-of-the art algorithms in terms of accuracy, while being orders of magnitude faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

Fig. 13

An Unsupervised Anomaly Detection Algorithm for Time Series Big Data

Loners Stand Out. Identification of Anomalous Subsequences Based on Group Performance

Exact variable-length anomaly detection algorithm for univariate and multivariate time series

Article 31 July 2018

Xing Wang, Jessica Lin, … Martin Braun

Change history

31 August 2021
A Correction to this paper has been published: https://doi.org/10.1007/s00778-021-00678-1

Notes

If the dimension that imposes the ordering of the sequence is time then we talk about time series. In the rest of this paper, we will use the terms sequence, data series, and time series interchangeably.
http://www.safran-group.com/.
A preliminary version of this paper and a corresponding demo paper have appeared elsewhere [10, 11].
The authors of these papers define the problem as kth-discord discovery.

References

http://data-acoustics.com/measurements/bearing-faults/bearing-4/ (2007)
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (2015)
http://helios.mi.parisdescartes.fr/~themisp/norma/
Abboud, D., Elbadaoui, M., Smith, W., Randall, R.: Advanced bearing diagnostics: A comparative study of two powerful approaches. MSSP 114 (2019)
Abdul-Aziz, A., Woike, M.R., Oza, N.C., Matthews, B.L., lekki, J.D.: Rotor health monitoring combining spin tests and data-driven anomaly detection methods. Struct. Health Monit. (2012)
Ahmad, S., Lavin, A., Purdy, S., Agha, Z.: Unsupervised real-time anomaly detection for streaming data. Neurocomputing (2017)
Antoni, J., Borghesani, P.: A statistical methodology for the design of condition indicators. Mech. Syst. Signal Process. 290–327 (2019)
Bagnall, A.J., Cole, R.L., Palpanas, T., Zoumpatianos, K.: Data series management (dagstuhl seminar 19282). Dagstuhl Rep. 9(7), 24–39 (2019)
Barnet, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)
Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: Automated Anomaly Detection in Large Sequences. In: ICDE pp. 1834–1837 (2020)
Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: SAD: an unsupervised system for subsequence anomaly detection. In: 36th IEEE International Conference on Data Engineering, ICDE, pp. 1778–1781. IEEE (2020)
Boniol, P., Palpanas, T.: Series2graph: graph-based subsequence anomaly detection for time series. Proc. VLDB Endow. 13(11), 1821–1834 (2020)
Article Google Scholar
Boniol, P., Palpanas, T., Meftah, M., Remy, E.: Graphan: graph-based subsequence anomaly detection. Proc. VLDB Endow. 13(12), 2941–2944 (2020)
Article Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD (2000)
Bryant, P.G.: On the minimum description length (mdl) principle for hierarchical classifications. In: Data Science, Classification, and Related Methods (1998)
Bu, Y., Chen, L., Fu, A.W.C., Liu, D.: Efficient anomaly monitoring over moving object trajectory streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 159–168. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1557019.1557043
Bu, Y., Leung, O.T., Fu, A.W., Keogh, E.J., Pei, J., Meshkin, S.: WAT: finding top-k discords in time series database. In: SIAM (2007)
Chiu, B.Y., Keogh, E.J., Lonardi, S.: Probabilistic discovery of time series motifs. In: KDD (2003)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB 2, 112–127 (2018)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB 13, 402–419 (2019)
Google Scholar
Fu, A.W., Leung, O.T., Keogh, E.J., Lin, J.: Finding time series discords based on haar transform. In: ADMA pp. 31–41 (2006)
Gharghabi, S., Yeh, C.M., Ding, Y., Ding, W., Hibbing, P., LaMunion, S., Kaplan, A., Crouter, S.E., Keogh, E.J.: Domain agnostic online semantic segmentation for multi-dimensional time series. Data Min. Knowl. Discov. 33(1), 96–130 (2019)
Article MathSciNet Google Scholar
Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000 (June 13)). Circulation Electronic Pages: http://circ.ahajournals.org/content/101/23/e215.fullPMID:1085218; https://doi.org/10.1161/01.CIR.101.23.e215
Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1-6:20 (2016)
Article Google Scholar
Hadjem, M., Naït-Abdesselam, F., Khokhar, A.A.: St-segment and t-wave anomalies prediction in an ECG data using rusboost. In: Healthcom (2016)
Keogh, E., Lin, J.: Clustering of time-series subsequences is meaningless: implications for previous and future research. KAIS 8(2) (2004)
Keogh, E., Lonardi, S., Ratanamahatana, C., Wei, L., Lee, S.H., Handley, J.: Compression-based data mining of sequential data. DMKD 14, 99–129 (2007)
MathSciNet Google Scholar
Keogh, E.J., Lin, J., Fu, A.W.: HOT SAX: efficiently finding the most unusual time series subsequence. In: ICDM (2005)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: sortable summarizations for scalable indexes over static and streaming data series. VLDBJ 28(6) (2019)
Lee, J., Han, J., Li, X.: Trajectory outlier detection: a partition-and-detect framework. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 140–149 (2008)
Lee, T., Gottschlich, J., Tatbul, N., Metcalf, E., Zdonik, S.: greenhouse: a zero-positive machine learning system for time-series anomaly detection. CoRR arXiv:abs/1801.03168 (2018). URL http://arxiv.org/abs/1801.03168
Li, X., Lin, J.: Linear time motif discovery in time series. In: Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 136–144. SIAM (2019)
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ulisse approach. PVLDB 11, 2236–2248 (2019)
Google Scholar
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.: Matrix profile x: Valmod - scalable discovery of variable-length motifs in data series. In: SIGMOD (2018)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix Profile Goes MAD: variable-length motif and discord discovery in data series. In: DAMI (2020)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: ICDM, ICDM (2008)
Liu, Y., Chen, X., Wang, F.: Efficient detection of discords for time series stream. In: Advances in Data and Web Management (2009)
Luo, W., Gallagher, M.: Faster and parameter-free discord search in quasi-periodic time series. In: Advances in Knowledge Discovery and Data Mining (2011)
Malhotra, P., Vig, L., Shroff, G., Agarwal, P.: Long short term memory networks for anomaly detection in time series. In: ESANN (2015)
Moody, G.B., Mark, R.G.: The impact of the mit-bih arrhythmia database. IEEE Eng. Med. Biol. Mag. 20, 45–50 (2001)
Article Google Scholar
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM (2009)
Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Rec. 44(2), 47–52 (2015)
Article Google Scholar
Palpanas, T.: Evolution of a Data Series Index. In: CCIS, pp. 68–83 (2020)
Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). SIGREC 48(3) (2019)
Paparrizos, J., Gravano, L.: K-shape: efficient and accurate clustering of time series. SIGMOD Rec. 45(1), 69–76 (2016). https://doi.org/10.1145/2949741.2949758
Article Google Scholar
Paul Boniol (advisor: Themis Palpanas): Unsupervised subsequence anomaly detection in large sequences. In: Proceedings of the VLDB 2020 PhD Workshop colocated with the 46th International Conference on Very Large Databases (VLDB 2020), CEUR Workshop Proceedings, vol. 2652 (2020)
Peng, B., Palpanas, T., Fatourou, P.: Messi: In-memory data series indexing. In: ICDE (2020)
Peng, B., Palpanas, T., Fatourou, P.: Paris+: data series indexing on multi-core architectures. In: TKDE (2020)
Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: 2011 IEEE 11th International Conference on Data Mining, pp. 547–556 (2011)
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Article Google Scholar
Safran: Personal communication with Dr. Dohy Hong (2018)
Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Time series anomaly discovery with grammar-based compression. In: EDBT (2015)
Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Grammarviz 3.0: Interactive discovery of variable-length time series patterns. TKDD 12, 1–28 (2018)
Article Google Scholar
Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 19, 24–27 (2009)
Google Scholar
Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., Gunopulos, D.: Online outlier detection in sensor data using non-parametric models. In: VLDB (2006)
Wang, J., Balasubramanian, A., de la Vega, L.M., Green, J., Samal, A., Prabhakaran, B.: Word recognition from continuous articulatory movement time-series data using symbolic representations. In: SLPAT (2013)
Wang, X., Lin, J., Patel, N., Braun, M.: A self-learning and online algorithm for time series anomaly detection, with application in CPU manufacturing. In: CIKM (2016)
Whitney, C., Gottlieb, D., Redline, S., Norman, R., Dodge, R., Shahar, E., Surovec, S., Nieto, F.: Reliability of scoring respiratory disturbance indices and sleep staging. Sleep 21, 749–757 (1998)
Article Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945). http://www.jstor.org/stable/3001968
Wu, Q., Qi, X., Fuller, E., Zhang, C.Q.: Follow the leader: A centrality guided clustering and its application to social network analysis. Sci. World J. (2013)
Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: finding unusual time series in terabyte sized datasets. In: ICDM (2007)
Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: finding unusual time series in terabyte sized datasets. KAIS 17(2) (2008)
Yankov, D., Keogh, E.J., Medina, J., Chiu, B.Y., Zordan, V.B.: Detecting time series motifs under uniform scaling. In: KDD (2007)
Yeh, C., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H., Silva, D., Mueen, A., Keogh, E.: Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: ICDM (2016)
Yu, Y., Cao, L., Rundensteiner, E.A., Wang, Q.: Outlier detection over massive-scale trajectory streams. ACM Trans. Database Syst. (TODS) 42, 1–33 (2017)
Zhu, Y., Zimmerman, Z., Senobari, N.S., Yeh, C.M., Funning, G., Mueen, A., Brisk, P., Keogh, E.: Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 739–748 (2016). https://doi.org/10.1109/ICDM.2016.0085

Download references

Author information

Authors and Affiliations

EDF R&D, Paris, France
Paul Boniol, Mohammed Meftah & Emmanuel Remy
Université de Paris, Paris, France
Michele Linardi, Federico Roncallo & Themis Palpanas

Authors

Paul Boniol
View author publications
You can also search for this author in PubMed Google Scholar
Michele Linardi
View author publications
You can also search for this author in PubMed Google Scholar
Federico Roncallo
View author publications
You can also search for this author in PubMed Google Scholar
Themis Palpanas
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Meftah
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Remy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Boniol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boniol, P., Linardi, M., Roncallo, F. et al. Unsupervised and scalable subsequence anomaly detection in large data series. The VLDB Journal 30, 909–931 (2021). https://doi.org/10.1007/s00778-021-00655-8

Download citation

Received: 23 March 2020
Revised: 12 October 2020
Accepted: 16 January 2021
Published: 03 March 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s00778-021-00655-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised and scalable subsequence anomaly detection in large data series

Abstract

Access this article

Similar content being viewed by others

An Unsupervised Anomaly Detection Algorithm for Time Series Big Data

Loners Stand Out. Identification of Anomalous Subsequences Based on Group Performance

Exact variable-length anomaly detection algorithm for univariate and multivariate time series

Change history

31 August 2021

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised and scalable subsequence anomaly detection in large data series

Abstract

Access this article

Similar content being viewed by others

An Unsupervised Anomaly Detection Algorithm for Time Series Big Data

Loners Stand Out. Identification of Anomalous Subsequences Based on Group Performance

Exact variable-length anomaly detection algorithm for univariate and multivariate time series

Change history

31 August 2021

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation