Abstract
Similarity search in streaming time series is a challenging problem due to tight requirements in processing streaming data and replying feedback, e.g., quickly processing a time-series stream of high speed, and accurately replying found results to a query system. These difficulties urge researchers of time-series data mining to have a framework at hand for building systems of similarity search in streaming time series based on a platform specializing in handling streaming data. In the paper, we introduce a framework of similarity search in streaming time series based on Spark Streaming. Subsequently, a prototype system implementing the framework would be proposed to demonstrate the feasibility of the framework for building similarity search systems which can work efficiently and effectively in streaming context. In addition, the prototype system takes advantages of SUCR-DTW to perform similarity search efficiently in streaming environment under Dynamic Time Warping. The experimental results obtained from the prototype system demonstrate that the Spark job of similarity search in streaming time series is accomplished quickly and accurately. The subsequences of streaming time series, which are similar to predefined queries, are found in near real time. They are the same as those obtained from the execution of similarity search in streaming time series by another reference system. Furthermore, the prototype system has high scalability, stably works while processing time-series streams of high steady rate. These experimental results also underline the value of the combination of Spark Streaming and SUCR-DTW to handle the challenging problem.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available because the data are parts of the results of the CS2020-19 project funded by SGU. All results generated by the project are managed by and belong to the funder.
References
The Apache Software Foundation (2018) Spark streaming. https://spark.apache.org/streaming/. Accessed 01 June 2020
Zhang X, Qian Z, Shen S, Shi J, Wang S (2019) Streaming massive electric power data analysis based on Spark Streaming. In: Proceedings of international conference on database systems for advanced applications, pp 200–212, DOI https://doi.org/10.1007/978-3-030-18590-9_14
Paolis D, Tommaso L, Luca VD, Paiano R (2018) Sensor data collection and analytics with thingsboard and spark streaming. In: Proceedings of 2018 IEEE workshop on environmental, energy, and structural monitoring systems (EESMS), pp 1–6, DOI https://doi.org/10.1109/EESMS.2018.8405822, (to appear in print)
Berndt DJ, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: Proceedings of AAAI Workshop on Knowledge Discovery in Databases, Seattle, Washington, USA, pp 359–370
Giao B C, Anh D T (2016) Similarity search for numerous patterns over multiple time series streams under dynamic time warping which supports data normalization. Vietnam J Comput Sci 3(3):181–196. https://doi.org/10.1007/s40595-016-0062-4
Luo W, Li Y, Yao F, Wang S, Li Z, Zhan P, Li X (2021) Multi-resolution representation for streaming time series retrieval. Int J Pattern Recog Artif Intell 35(06):2150019. https://doi.org/10.1142/S0218001421500191
Zhan P, Sun C, Hu Y, Luo W, Zheng J, Li X (2020) Feature-based online representation algorithm for streaming time series similarity search. Int J Pattern Recog Artif Intell 34(05):2050010. https://doi.org/10.1142/S021800142050010X
Keogh E, Smyth P (1997) A probabilistic approach to fast pattern matching in time. In: Proceedings of third international conference knowledge discovery and data mining, vol 97. AAAI Press, 1997, California, USA, pp 24–30
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, pp 151–162, DOI https://doi.org/10.1145/375663.375680
Aggarwal CC, Philip SY, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings of 2003 VLDB Conference, pp 81–92, DOI https://doi.org/10.1016/B978-012722442-8/50016-1
Hartigan JA, Wong MA (1979) Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Seri C (Appl Stat) 28(1):100–108. https://doi.org/10.2307/2346830
Ziehn A, Charfuelan M, Hemsen H, Markl V (2019) Time series similarity search for streaming data in distributed systems. In: Workshops of the EDBT/ICDT 2019 Joint Conference (EDBT/ICDT 2019), Lisbon, Portugal
The Apache Software Foundation (2014) Apache Flink. https://flink.apache.org/. Accessed 01 Sept 2021
Ding Y, Luo W, Zhao Y, Li Z, Zhan P, Li X (2019) A novel similarity search approach for streaming time series. J Phys Conf Ser 1302(2):022084. https://doi.org/10.1088/1742-6596/1302/2/022084
Oregi I, Péres A, Ser DJ, Lozano JA (2017) On-line Dynamic Time Warping for streaming time series. In: Joint european conference on machine learning and knowledge discovery in databases, pp 591–605, DOI https://doi.org/10.1007/978-3-319-71246-8_36
Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Sign Process 26(1):43–49. https://doi.org/10.1109/TASSP.1978.1163055
Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under Dynamic Time Warping. In: Proceedings of The 18th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’12), pp 262–270, DOI https://doi.org/10.1145/2339530.2339576
The Apache Software Foundation (2018) Apache Spark. https://spark.apache.org/. Accessed 01 June 2020
The Apache Software Foundation (2008) Apache YARN. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html. Accessed 01 Sept 2020
The Apache Software Foundation (2012) Apache Mesos. http://mesos.apache.org/. Accessed 01 Sept 2020
The Apache Software Foundation (2006) Apache Hadoop. https://hadoop.apache.org/. Accessed 01 Sept 2020
The Apache Software Foundation (2009) Apache Flume. https://flume.apache.org/. Accessed 01 Sept 2020
The Apache Software Foundation (2017) Apache Kafka. https://kafka.apache.org/. Accessed 01 Sept 2020
Gupta G (2015) Learning real-time processing with Spark Streaming. Packt Publishing Ltd, Birmingham B3 2PB, UK
The Apache Software Foundation (2004) Apache Derby. https://db.apache.org/derby/. Accessed 01 Sept 2020
West M (2021) Time-series data. http://www2.stat.duke.edu/~mw/mwsoftware/moredata/ts_data. Accessed 01 Sept 2021
Weigend AS (2016) SantaFe Time Series. http://www-psych.stanford.edu/~andreas/Time-Series/SantaFe.html. Accessed Dec 2016
Group MP (2016) Datasets relate to the operation of the electricity market. http://ftp.emi.ea.govt.nz/Datasets/. Accessed Dec 2016
Funding
This research is funded by Saigon University (SGU) under grant number CS2020-19.
Author information
Authors and Affiliations
Contributions
Bui Cong Giao mainly wrote the paper, implemented and experimented the framework. Phan Cong Vinh contributed in framework design and paper proofread.
Corresponding author
Ethics declarations
Ethics approval
Not Applicable.
Conflict of interests
The authors declare that there are no conflicts of interest regarding the publication of this paper
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Giao, B.C., Vinh, P.C. A Framework for Similarity Search in Streaming Time Series based on Spark Streaming. Mobile Netw Appl 27, 2084–2097 (2022). https://doi.org/10.1007/s11036-022-01988-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11036-022-01988-6