Fast data series indexing for in-memory data

Peng, Botao; Fatourou, Panagiota; Palpanas, Themis

doi:10.1007/s00778-021-00677-2

Fast data series indexing for in-memory data

Regular Paper
Published: 18 June 2021

Volume 30, pages 1041–1067, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

827 Accesses
10 Citations
Explore all metrics

Abstract

Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-socket and multi-core architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. MESSI supports similarity search using both the Euclidean and dynamic time warping (DTW) distances. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in \(\sim \)50 ms (30–75 ms across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Big Data: An Introduction

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Marios Fragkoulis, Paris Carbone, … Asterios Katsifodimos

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

Notes

A data series, or data sequence, is an ordered sequence of data points. If the ordering dimension is time, then we talk about time series, though, series can be ordered over other measures (e.g., angle in astronomical radial profiles, frequency in infrared spectroscopy, mass in mass spectroscopy, position in genome sequences, etc.).
http://www.airbus.com/.
A preliminary version of this work has appeared elsewhere [63].
A preliminary version of this paper has appeared elsewhere [63].
We also tried an alternative design, where buffers were not split, so many threads could try to update each element of a buffer concurrently. Therefore, each buffer had to be protected by a lock. This design resulted in worse performance due to the contention in accessing the iSAX buffers.
Parallelizing the processing inside each one of the index root subtrees would require a lot of synchronization due to node splitting. When a node is split, two new leaf nodes are created and the data of the original leaf are moved to the new leaves.
We note that other lower bounds for DTW can be used as well, such as LB_Improved [45]. Even though LB_Improved can produce tighter bounds, in our experiments it also resulted in higher query answering times due to the additional computations it involves.
In such a case, indexing and similarity search would not be useful anyways.
MESSI can be adapted to support subsequence matching as follows: given a long series (in which we need to identify the most similar subsequence to the query), we extract subsequences from the long series by sliding a window (of the same length as the query) over the entire length of the series, and then index all these subsequences.

References

Adhd-200. http://fcon\_1000.projects.nitrc.org/indi/adhd200/ (2017)
Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO (1993)
Ailamaki, A.: Databases and hardware: The beginning and sequel of a beautiful friendship. VLDB (2015)
Alvarez, V., Schuhknecht, F.M., Dittrich, J., Richter, S.: Main memory adaptive indexing for multi-core systems. In: DaMoN (2014)
Bagnall, A.J., Cole, R.L., Palpanas, T., Zoumpatianos, K.: Data series management (dagstuhl seminar 19282). Dagstuhl Reports 9(7), (2019)
Bagnall, A.J., Lines, J., Bostrom, A., Large, J., Keogh, E.J.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017). https://doi.org/10.1007/s10618-016-0483-9
Article MathSciNet Google Scholar
Binna, R., Zangerle, E., Pichl, M., Specht, G., Leis, V.: Hot: A height optimized trie index for main-memory database systems. In: SIGMOD. ACM (2018)
Blanas, S.: Query processing for datacenter-scale computers. In: CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings (2017)
Boniol, P., Linardi, M., Roncallo, F., Palpanas, T., Meftah, M., Remy, E.: Unsupervised and scalable subsequence anomaly detectionin large data series. In: VLDBJ (2021)
Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: Automated anomaly detection in large sequences. In: ICDE (2020)
Boniol, P., Palpanas, T.: Series2Graph: graph-based subsequence anomaly detection for time series. In: PVLDB (2020)
Boniol, P., Paparrizos, J., Palpanas, T., Franklin, M.J.: SAND in action: subsequence anomaly detection for streams. In: PVLDB (2021)
Boniol, P., Paparrizos, J., Palpanas, T., Franklin, M.J.: SAND: streaming subsequence anomaly detection. In: PVLDB (2021)
Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 39(1), 2014 (2014)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. CSUR (2009)
Chatzigeorgakidis, G., Skoutas, D., Patroumpas, K., Palpanas, T., Athanasiou, S., Skiadopoulos, S.: Local pair and bundle discovery over co-evolving time series. In: Proceedings of the 16th International Symposium on Spatial and Temporal Databases, SSTD (2019)
Chatzigeorgakidis, G., Skoutas, D., Patroumpas, K., Palpanas, T., Athanasiou, S., Skiadopoulos, S.: Local similarity search on geolocated time series using hybrid indexing. In: Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL (2019)
Chatzigeorgakidis, G., Skoutas, D., Patroumpas, K., Palpanas, T., Athanasiou, S., Skiadopoulos, S.: Twin subsequence search in time series. In: Proceedings of the 24th International Conference on Extending Database Technology, EDBT (2021)
Chou, J., Wu, K., et al.: Fastquery: A parallel indexing system for scientific data. In: CLUSTER, pp. 455–464. IEEE (2011)
Coorporation, I.: Intel 64 and ia-32 architectures optimization reference manual (2016)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the Lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB (2019)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The Lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB (2018)
Echihabi, K., Zoumpatianos, K., Palpanas, T.: Big sequence management: on scalability. In: Proceedings of the IEEE International Conference on Big Data, IEEE BigData (2020)
Echihabi, K., Zoumpatianos, K., Palpanas, T.: Big sequence management: Scaling up and out. In: Proceedings of the 24th International Conference on Extending Database Technology, EDBT (2021)
Fekete, J.D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis. CoRR (2016)
Feng, K., Wang, P., Wu, J., Wang, W.: L-match: a lightweight and effective subsequence matching approach. IEEE Access 8, 71572–71583 (2020)
Article Google Scholar
Gepner, P., Kowalik, M.F.: Multi-core processors: new way to achieve high system performance. In: PAR ELEC (2006)
Gogolou, A., Tsandilas, T., Echihabi, K., Bezerianos, A., Palpanas, T.: Data series progressive similarity search with probabilistic quality guarantees. In: Maier, D., Pottinger, R., Doan, A., Tan, W., Alawini, A., Ngo, H.Q. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD (2020)
Gogolou, A., Tsandilas, T., Palpanas, T., Bezerianos, A.: Progressive similarity search on time series data. In: EDBT (2019)
Gowanlock, M.G., Casanova, H.: Distance threshold similarity searches: efficient trajectory indexing on the GPU. IEEE Trans. Parallel Distrib. Syst. 27(9), 2016 (2016)
Article Google Scholar
Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1–6:20 (2016)
Article Google Scholar
Guillaume, A.: Head of Operational Intelligence Department Airbus. Personal communication (2017)
Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc, Revised Reprint (2012)
http://helios.mi.parisdescartes.fr/~themisp/messi/ (2020)
Incorporated Research Institutions for Seismology—Seismic Data Access. http://ds.iris.edu/data/access/ (2016)
Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342 (2011)
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS (2001)
Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: KDD (1998)
Keogh, E., Ratanamahatana, C.A.: Exact indexing of dynamic time warping. Knowledge and information systems (2005)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut palm: Static and streaming data series exploration now in your palm. In: SIGMOD (2019)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: A scalable bottom-up approach for building data series indexes. PVLDB (2018)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: sortable summarizations for scalable indexes over static and streaming data series. VLDBJ 28(6), 2019 (2019)
Google Scholar
Laviron, P., Dai, X., Huquet, B., Palpanas, T.: Electricity demand activation extraction: From known to uknown signatures, using similarity search. In: Proceedings of the ACM International Conference on Future Energy Systems, e-Energy (2021)
Leis, V., Kemper, A., Neumann, T.: The adaptive radix tree: Artful indexing for main-memory databases. In: ICDE (2013)
Lemire, D.: Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recognit. 42(9), 2169–2180 (2009)
Article Google Scholar
Levchenko, O., Kolev, B., Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T., Shasha, D.E., Valduriez, P.: Bestneighbor: efficient evaluation of knn queries on large time series databases. Knowl. Inf. Syst. 63(2), 349–378 (2021)
Article Google Scholar
Li, C., Yu, P.S., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE (1996)
Liao, T.W.: Clustering of time series data—a survey. Pattern Recognit. 38(11), 1857–1874 (2005)
Article Google Scholar
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: The ulisse approach. PVLDB (2019)
Linardi, M., Palpanas, T.: ULISSE: ULtra compact Index for Variable-Length Similarity SEarch in Data Series. In: ICDE (2018)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix Profile Goes MAD: Variable-Length Motif And Discord Discovery in Data Series. In: DAMI (2020)
Linardi, M., Palpanas, T.: Scalable data series subsequence matching with ULISSE. VLDB J. 29(6), 1449–1474 (2020)
Article Google Scholar
Lomet, D.B., Nawab, F.: High performance temporal indexing on modern hardware. In: ICDE (2015)
Lomont, C.: Introduction to intel advanced vector extensions. Intel White Paper (2011)
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B., Shamlo, N.B.: A disk-aware algorithm for time series motif discovery. DAMI (2011)
Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD (2010)
Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). SIGREC 48(3) (2019)
Palpanas, T.: Data series management: The road to big sequence analytics. SIGMOD Record (2015)
Palpanas, T.: Evolution of a Data Series Index. CCIS 1197 (2020)
Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS (2017)
Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: A fast, scalable, in-memory time series database. VLDB (2015)
Peng, B., Fatourou, P., Palpanas, T.: SING: Sequence Indexing Using GPUs. In: ICDE (2021)
Peng, B., Palpanas, T., Fatourou, P.: Messi: In-memory data series indexing. In: ICDE (2020)
Peng, B., Palpanas, T., Fatourou, P.: Paris: The next destination for fast data series indexing and query answering. IEEE BigData (2018)
Peng, B., Palpanas, T., Fatourou, P.: Paris+: Data series indexing on multi-core architectures. TKDE (2020)
Piatov, D., Helmer, S., Dignös, A., Gamper, J.: Interactive and space-efficient multi-dimensional time series subsequence matching. Inf. Syst. 82, 121–135 (2019)
Article Google Scholar
Polychroniou, O., Raghavan, A., Ross, K.A.: Rethinking SIMD vectorization for in-memory databases. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 1493–1508 (2015)
Polychroniou, O., Raghavan, A., Ross, K.A.: Rethinking simd vectorization for in-memory databases. In: SIGMOD. ACM (2015)
Polychroniou, O., Ross, K.A.: Vectorized bloom filters for advanced SIMD processors. In: Tenth International Workshop on Data Management on New Hardware, DaMoN 2014, Snowbird, UT, USA, June 23, 2014, pp. 6:1–6:6 (2014)
Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: SIGKDD (2012)
Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: ICDM, pp. 547–556 (2011)
Rodrigues, P.P., Gama, J., Pedroso, J.: Hierarchical clustering of time-series data streams. TKDE (2008)
Shieh, J., Keogh, E.: i sax: indexing and mining terabyte sized time series. In: SIGKDD (2008)
Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. DMKD (2009)
Sloan digital sky survey. https://www.sdss3.org/dr10/data_access/volume.php (2017)
Southwest university adult lifespan dataset (sald). http://fcon\_1000.projects.nitrc.org/indi/retro/sald.html (2018)
Tan, C.W., Webb, G.I., Petitjean, F.: Indexing and classifying gigabytes of time series under time warping. In: ICDM (2017)
Tang, B., Yiu, M.L., Li, Y., et al.: Exploit every cycle: Vectorized time series algorithms on modern commodity cpus. In: IMDM (2016)
Tatikonda, S., Parthasarathy, S.: An adaptive memory conscious approach for mining frequent trees: implications for multi-core architectures. In: SIGPLAN. ACM (2008)
Wang, Q., Palpanas, T.: Deep Learning Embeddings for Data Series Similarity Search. In: SIGKDD (2021)
Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. VLDB (2013)
Wu, J., Wang, P., Pan, N., Wang, C., Wang, W., Wang, J.: Kv-match: A subsequence matching approach supporting normalization and time warping. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 866–877. IEEE (2019)
Xiao, L., Zheng, Y., Tang, W., Yao, G., Ruan, L.: Parallelizing dynamic time warping algorithm using prefix computations on gpu. In: (HPCC\_EUC). IEEE (2013)
Xie, Z., Cai, Q., Chen, G., Mao, R., Zhang, M.: A comprehensive performance evaluation of modern in-memory indices. In: ICDE (2018)
Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Massively distributed time series indexing and querying. IEEE Trans. Knowl. Data Eng. 32(1), 108–120 (2020)
Article Google Scholar
Yi, B.K., Faloutsos, C.: Fast time sequence indexing for arbitrary lp norms. In: VLDB. Citeseer (2000)
Zeuch, S., Freytag, J., Huber, F.: Adapting tree structures for processing with SIMD instructions. In: EDBT (2014)
Zhou, J., Ross, K.A.: Implementing database operations using simd instructions. In: SIGMOD (2002)
Zoumpatianos, K., Palpanas, T.: Data series management: fulfilling the need for big sequence analytics. In: ICDE (2018)
Zoumpatianos, K., Idreos, S., Palpanas, T.: Ads: the adaptive data series index. VLDB J. 25, 843–866 (2016)
Article Google Scholar
Zoumpatianos, K., Lou, Y., Ileana, I., Palpanas, T., Gehrke, J.: Generating data series query workloads. VLDB J. 27(6), 823–846 (2018)
Article Google Scholar

Download references

Acknowledgements

Work was supported by Investir l’Avenir, Univ. of Paris IDEX Emergence en Recherche ANR-18-IDEX-000, CSC, FMJH PGMO, EDF, Thales, HIPEAC 4 and partly performed when P. Fatourou visited LIPADE and B. Peng visited CARV, FORTH ICS.

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Botao Peng
FORTH ICS, Crete, Greece
Panagiota Fatourou
LIPADE, Université de Paris & French University Institute (IUF), Paris, France
Themis Palpanas

Authors

Botao Peng
View author publications
You can also search for this author in PubMed Google Scholar
Panagiota Fatourou
View author publications
You can also search for this author in PubMed Google Scholar
Themis Palpanas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Botao Peng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, B., Fatourou, P. & Palpanas, T. Fast data series indexing for in-memory data. The VLDB Journal 30, 1041–1067 (2021). https://doi.org/10.1007/s00778-021-00677-2

Download citation

Received: 16 June 2020
Revised: 03 February 2021
Accepted: 03 May 2021
Published: 18 June 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s00778-021-00677-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast data series indexing for in-memory data

Abstract

Access this article

Similar content being viewed by others

Big Data: An Introduction

A survey on the evolution of stream processing systems

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Big Data: An Introduction

A survey on the evolution of stream processing systems

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation