Abstract
Data series similarity search is an important operation, and at the core of several analysis tasks and applications related to data series collections. Despite the fact that data series indexes enable fast similarity search, all existing indexes can only answer queries of a single length (fixed at index construction time), which is a severe limitation. In this work, we propose ULISSE, the first data series index structure designed for answering similarity search queries of variable length (within some range). Our contribution is twofold. First, we introduce a novel representation technique, which effectively and succinctly summarizes multiple sequences of different length. Based on the proposed index, we describe efficient algorithms for approximate and exact similarity search, combining disk-based index visits and in-memory sequential scans. Our approach supports non-Z-normalized and Z-normalized sequences and can be used with no changes with both Euclidean distance and dynamic time warping, for answering both k-NN and \(\epsilon \)-range queries. We experimentally evaluate our approach using several synthetic and real datasets. The results show that ULISSE is several times, and up to orders of magnitude more efficient in terms of both space and time cost, when compared to competing approaches.
Similar content being viewed by others
Notes
If the dimension that imposes the ordering of the sequence is time, then we talk about time series. However, a series can also be defined over other measures (e.g., angle in radial profiles in astronomy, mass in mass spectroscopy in physics, etc.). We use the terms data series, time series and sequence interchangeably.
Z-normalization transforms a series so that it has a mean value of zero, and a standard deviation of one. This allows similarity search to be effective, irrespective of shifting (i.e., offset translation) and scaling [51].
References
Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP, (1999)
Raza, U., Camerra, A., Murphy, A.L., Palpanas, T., Picco, G.P.: Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng. 27(8), 2231–2244 (2015)
Shasha, D.: Tuning time series queries in finance: Case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)
Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comput. Intell. Mag. 9(3), 27–39 (2014)
Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Rec. 44(2), 47–52 (2015)
ESA. SENTINEL-2 mission. https://sentinel.esa.int/web/sentinel/missions/sentinel-2
Zoumpatianos, K., Palpanas, T.: Data series management: Fulfilling the need for big sequence analytics. In: ICDE, (2018)
Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). SIGMOD Rec. 48(3), 36–40 (2019)
Bagnall, A.J., Cole, R.L., Palpanas, T., Zoumpatianos, K.: Data series management. Dagstuhl Reports 9(7), 47–52 (2019)
Niennattrakul, V., Ratanamahatana, C. A.: On clustering multimedia time series data using k-means and dynamic time warping. MUE ’07, (2007)
Lines, J., Bagnall, A.: Time series classification with ensembles of elastic distance measures. DAMI 29(3), 565–592 (2015)
Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Time series anomaly discovery with grammar-based compression. In: EDBT, (2015)
Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: Automated Anomaly Detection in Large Sequences. In: ICDE, (2020)
Boniol, P., Palpanas, T.: Series2Graph: Graph-based Subsequence Anomaly Detection for Time Series. PVLDB, (2020)
Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD, (2014)
Palpanas, T.: Big sequence management: a glimpse of the past, the present, and the future. In: SOFSEM, (2016)
Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS, (2017)
Gogolou, A., Tsandilas, T., Palpanas, T., Bezerianos, A.: Progressive similarity search on time series data. In: BigVis, in Conjunction with EDBT/ICDT, (2019)
Gogolou, A., Tsandilas, T., Echihabi, K., Bezerianos, A., Palpanas, T.: Data series progressive similarity search with probabilistic quality guarantees. In: SIGMOD (2020)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB 12(2), 112–127 (2018)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB 13(3), 403–420 (2019)
Palpanas, T.: Evolution of a Data Series Index—The iSAX family of data series indexes. In: CCIS, (2020)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, (1994)
Rafiei, D., Mendelzon, A.: Efficient retrieval of similar time sequences using dft. In: ICDE, (1998)
Keogh, E.J., Palpanas, T., Zordan, V.B., Gunopulos, D., Cardle, M.: Indexing large human-motion databases. In: VLDB, (2004)
Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: Efficient time series search and retrieval. In EDBT, (2008)
Shieh, J., Keogh, E.J.: isax: indexing and mining terabyte sized time series. In: KDD, pp. 623–631, (2008)
Kadiyala, S., Shiri, N.: A compact multi-resolution index for variable length queries in time series databases. KAIS 15(2), 131–147 (2008)
Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10), 793–804 (2013)
Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.J.: Beyond one billion time series: indexing and mining very large time series collections with isax2+. KAIS, (2014)
Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. PVLDB 1(8), 13–24 (2014)
Zoumpatianos, K., Idreos, S., Palpanas, T.: RINSE: interactive data series exploration with ADS+. PVLDB 8(12), 1912–1915 (2015)
Zoumpatianos, K., Idreos, S., Palpanas, T.: ADS: the adaptive data series index. VLDB J. 25(6), 843–866 (2016)
Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Dpisax: Massively distributed partitioned isax. In: ICDM, (2017)
Yagoubi, D.-E., Akbarinia, R., Masseglia, F., Palpanas, T.: Massively distributed time series indexing and querying. TKDE 32(1), 108–120 (2020)
Peng, B., Fatourou, P., Palpanas, T.: Paris: The next destination for fast data series indexing and query answering. In: IEEE Big Data, (2018)
Peng, B., Palpanas, T., Fatourou, P.: Paris+: Data series indexing on multi-core architectures. In: TKDE, (2020)
Peng, B., Palpanas, T., Fatourou, P.: Messi: In-memory data series indexing. In: ICDE, (2020)
Peng, Botao: (supervised by Panagiota Fatourou and Themis Palpanas). Data Series Indexing Gone Parallel. In ICDE PhD Workshop, (2020)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: a scalable bottom-up approach for building data series indexes. PVLDB 11(6), 677–690 (2018)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut palm: Static and streaming data series exploration now in your palm. In: SIGMOD, (2019)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: sortable summarizations for scalable indexes over static and streaming data series. VLDBJ 28(6), 847–869 (2019)
Kahveci, T., Singh, A.: Variable length queries for time series data. In: ICDE, (2001)
Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G. E. A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: SIGKDD, (2012)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E. J.: Matrix profile X: VALMOD—scalable discovery of variable-length motifs in data series. In: SIGMOD Conference (2018)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E. J.: VALMOD: A suite for easy and exact detection of variable length motifs in data series. In: SIGMOD Conference (2018)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix Profile Goes MAD: Variable-length motif and discord discovery in data series. In: DAMI, (2020)
Linardi, Michele: (supervised by Themis Palpanas). Effective and Efficient Variable-Length Data Series Analytics. In: VLDB PhD Workshop, (2019)
A.G.H. of Operational Intelligence Department Airbus. Personal communication., (2017)
Rosa, A.C., Parrino, L., Terzano, M.G.: Automatic detection of cyclic alternating pattern (cap) sequences in sleep: preliminary results. Clin. Neurophysiol. 110(4), 585–592 (1999)
Keogh, E.J., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. DAMI 7(4), 349–371 (2003)
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: isax 2.0: Indexing and mining one billion time series. In: ICDM (2010)
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: The ULISSE approach. PVLDB 11(13), 2236–2248 (2018)
Linardi, M., Palpanas, T.: ULISSE: ULtra compact index for variable-length similarity SEarch in data series. In: ICDE (2018)
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3), 263–286 (2000)
Loh, W., Kim, S., Whang, K.: A subsequence matching algorithm that supports normalization transform in time-series databases. Data Min. Knowl. Discov. 9(1), 5–28 (2004)
Han, W., Lee, J., Moon, Y., Jiang, H.: Ranked subsequence matching in time-series databases. In: VLDB, (2007)
Wu, J., Wang, P., Pan, N., Wang, C., Wang, W., Wang, J.: Kv-match: A subsequence matching approach supporting normalization and time warping. In: ICDE, (2019)
Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, (2014)
Kruskal, J., Liberman, M.: The symmetric time-warping problem: From continuous to discrete. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, 01 (1983)
Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 46–49 (1978)
Itakura, F.: Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal 23(1), 67–72 (1975)
Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing sax: a novel symbolic representation of time series. DAMI 15(2), 107–144 (2007)
Keogh, E.J., Ratanamahatana, C.A.: Exact indexing of dynamic time warping. Knowl. Inf. Syst. 7(3), 358–386 (2005)
Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: SIGKDD, (2015)
Zoumpatianos, K., Lou, Y., Ileana, I., Palpanas, T., Gehrke, J.: Generating data series query workloads. VLDB J. 27(6), 823–846 (2018)
Lichman, M.: UCI machine learning repository, (2013)
Terzano, M.G., Parrino, L., Sherieri, A., Chervin, R., Chokroverty, S., Guilleminault, C., Hirshkowitz, M., Mahowald, M., Moldofsky, H., Rosa, A., Thomas, R., Walters, A.: Atlas, rules, and recording techniques for the scoring of cyclic alternating pattern (cap) in human sleep. Sleep Med. 2(6), 537–553 (2001)
Healey JA, P.R.: Detecting stress during real-world driving tasks using physiological sensors. ITS 6(2), 156–166 (2016)
Soldi, S., Beckmann, V., Baumgartner, W.H., Ponti, G., Shrader, C.R., Lubinski, P., Krimm, H.A., Mattana, F., Tueller, J.: Long-term variability of agn at hard x-rays. Astronomy Astrophys. 563, A57 (2014)
IRIS. Seismic Data Access. http://ds.iris.edu/data/access, (2016)
Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.J.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. DAMI, (2017)
Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.J.: Experimental comparison of representation methods and distance measures for time series data. DAMI 26(2), 275–309 (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Linardi, M., Palpanas, T. Scalable data series subsequence matching with ULISSE. The VLDB Journal 29, 1449–1474 (2020). https://doi.org/10.1007/s00778-020-00619-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00619-4