A new approach for optimal offline time-series segmentation with error bound guarantee
Introduction
A time-series is a sequence of data points ordered in time such that where are individual observations and is the number of observations in a time series.
Time series have been applied in many areas such as medicine [2], economic [3], telecommunications [4] and online signature verification [5]. Due to the large amount of data used in most application areas, many methods have been proposed to reduce its size without losing relevant information [6].
In the context of this work, segmentation consists of dividing the time series into relevant points, cut points (CP), to reduce its dimensionality by means of a new representation space [7]. The problem of representation is a core issue in pattern recognition, and can dramatically impact the classification performance as well as the computational resources required to solve a particular problem, and the interpretability of the solutions found [8].
For this purpose, one of the most used techniques is called Piecewise Linear Approximation (PLA). PLA divides a time series into segments and uses a linear function to approximate each segment. There are two types of linear approximation [1]: linear interpolation that uses the straight line connecting the two endpoints of one segment to represent the data points in the segment and generates continuous piecewise lines; and linear regression that uses the regression line to approximate a segment and produces a set of disjointed lines.
Linear interpolation produces a smooth approximation and has low computational complexity because the number of straight lines is much smaller than the number of points of the time series.
The error of the approximation can be calculated by using the -norm and the -norm. The first one is computed from the sum of the squared vertical differences, between the straight line obtained in the segmentation and the real data points, squared. This value is called the integral squared error (ISE). The second one is computed from the maximal vertical difference between all the straight lines obtained in the segmentation and the real data points. This value is called maximum error (). ISE and are calculated as:where is the vertical distance from the real data to its corresponding straight line and is the number of points of the time series.
Depending on whether the objective is to minimize the error or the amount of information [9], the problem can be addressed by obtaining the best segmentation of a time series using segments (the holistic approximation error ISE or is minimized), or by obtaining the best segmentation of time series such that the maximum error for any segment () or the accumulated error of all segments (ISE) is less than some prefixed threshold (the amount of information used to represent the time series is minimized).
Many works have been proposed to obtain the best segmentation using segments, where -norm is mostly used. However, there are two main drawbacks when -norm is applied [10]: firstly, this constraint can not generate error-guaranteed representations for streaming data since the stream is naturally unbounded in size; secondly, the -norm is not able to control the approximation error on individual stream data items. To avoid these drawbacks, other methods based on -norm were proposed.
Taking into account how the cut points are obtained, segmentation methods can be categorized into three major groups of approaches [9]: Sliding Windows, where the segment increases until a preset error is exceeded and the new segment starts from the last point that does not exceed the preset error; Top-Down, where the time series is divided into smaller and smaller segments until a predetermined error is reached; and Bottom-Up, where, starting from the largest possible number of segments, these are merged until a predetermined error is exceeded.
Other techniques based on metaheuristics (Genetic Algorithms and particle swarm optimisation) were proposed [11], [12].
Depending on whether the time series is fixed in size or grows dynamically, the methods are classified as offline or online respectively. Offline methods segment the complete time series and online methods obtain segments based on the data seen so far. Sarker [13] compared the static with the dynamic segmentation considering if the number of segments is prefixed or not.
All the methods mentioned are suboptimal. Due to its high computational complexity, optimal methods can hardly be used in real-time applications. However, they can be used in order to evaluate the performance of suboptimal methods. An optimal offline method (OSTS) was proposed in Carmona-Poyato et al. [14] which is based on the algorithm, -norm and linear interpolation. Its computational time was greatly reduced by using pruning algorithms. This method was used to obtain the performance of some suboptimal methods based in -norm.
Xie et al. [10] proposed an optimal online method, based on -norm and linear regression. A linear interpolation-based method was also proposed by Xie but is not guaranteed to be optimal.
In this work, we propose a new optimal offline method, called OSFS, based on feasible space (FS) [1] that uses -norm and linear interpolation.
As it mentioned before, Xie [10] showed the main drawbacks of the -norm versus the -norm. Since our proposal is optimal taking into account the -norm, it will be compared to the optimal method (OSTS) proposed in Carmona-Poyato et al. [14], which is based on the -norm.
The present paper is arranged as follows. Section 2 describes the methods that were compared with the proposed method and whose performances were evaluated. Section 3 explains the new proposal. The experiments and results are detailed in Section 4. Finally, the main conclusions are summarized in Section 5.
Section snippets
Related work
In this section, some suboptimal time series segmentation methods will be described and their performances will be compared using the OSFS method and the new performance measure.
The first group of methods are heuristic and the second one are metaheuristic.
Proposed method (OSFS method)
In this work, a new optimal offline segmentation method called Optimal Segmentation based on Feasible Space (OSFS) is proposed. This method minimizes the number of segments with error bound guarantee. Therefore, given a maximum error allowed, based on the -norm, the number of cut points (or segments) that approximates the time series must be minimized. Since there can be several solutions, one will be obtained that also minimizes the value of ISE (-norm). The solution to our problem is
Experiments and results
This section shows the time series considered to evaluate the different methods, the experimental setting and the results obtained. The OSFS method has been implemented in C++ and all the experiments were run using an Intel(R) Core(TM) i7-870 K CPU at 3.70 GHz with 64 GB of RAM. The code and the time series used are available at https://github.com/ma1capoa/OSFS_Method
Conclusions and future improvements
The conclusions of this work can be summarized as follows. The present work proposes an optimal time-series segmentation with error bound guarantee (-norm). Taking into account that several optimal solutions are possible, the one that minimizes the RSME value (-norm) is obtained. In order to reduce the computational time by pruning suboptimal solutions, the feasible space method (FS) proposed by Liu [1] is used to obtain the possible successors of a cut point with error bound guarantee. On
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work has been developed with the support of the Research Projects TIN2016-75279-P and TIN2017-85887-C2-1-P of Spain Ministry of Economy, Industry, and Competitiveness, and FEDER.
A. Carmona-Poyato received his title of Agronomic Engineering and Ph.D. degree from the University of Cordoba (Spain), in 1986 and 1989, respectively. Since 1990 he has been working with the Department of Computing and Numerical Analysis of the University of Cordoba as lecturer. His research is focused on image processing, 2-D object recognition and time series analysis.
References (26)
- et al.
Syntactic recognition of ECG signals by attributed finite automata
Pattern Recognit.
(1995) - et al.
A shape-based adaptive segmentation of time-series using particle swarm optimization
Inf. Syst.
(2017) - et al.
Time-series clustering—A decade review
Inf. Syst.
(2015) - et al.
Dissimilarity-based representations for one class classification on time series
Pattern Recognit.
(2020) - et al.
A statistically-driven coral reef optimization algorithm for optimal size reduction of time series
Appl. Soft Comput.
(2018) - et al.
A new approach for optimal time-series segmentation
Pattern Recognit. Lett.
(2020) Optimal polygonal approximation of digitized curves using the sum of square deviations criterion
Pattern Recognit.
(2002)- et al.
Optimal polygonal approximation of digital curves
Pattern Recognit.
(1995) - et al.
Novel online methods for time series segmentation
IEEE Trans. Knowl. Data Eng.
(2008) - et al.
Pattern discovery of fuzzy time series for financial prediction
Knowl. Data Eng. IEEE Trans.
(2006)
Hancock: a language for extracting signatures from data streams
Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Segmenting time series: a survey and novel approach
Data Min. Time Ser. Databases
Cited by (5)
Adaptive error bounded piecewise linear approximation for time-series representation
2023, Engineering Applications of Artificial IntelligenceOptimal online time-series segmentation
2023, Knowledge and Information SystemsOptimal Segmented Linear Regression for Financial Time Series Segmentation
2021, IEEE International Conference on Data Mining Workshops, ICDMW
A. Carmona-Poyato received his title of Agronomic Engineering and Ph.D. degree from the University of Cordoba (Spain), in 1986 and 1989, respectively. Since 1990 he has been working with the Department of Computing and Numerical Analysis of the University of Cordoba as lecturer. His research is focused on image processing, 2-D object recognition and time series analysis.
N.L. Fernández-García received the Bachelor degree in Mathematics from Complutense University of Madrid (Spain) in 1988. He received the Ph.D. in Computer Science from the Polytechnic University of Madrid (Spain) in 2002. Since 1990, he has been working with the Department of Computing and Numerical Analysis at Córdoba University, currently he is assistant professor. His research is focused on edge detection, 2-D object recognition, evaluation of computer vision algorithms and time series analysis.
F.J. Madrid-Cuevas received the Bachelor degree in Computer Science from Malaga University (Spain) and the Ph.D. degree from Polytechnic University of Madrid (Spain), in 1995 and 2003 respectively. Since 1996 he has been working with the Department of Computing and Numerical Analysis of Cordoba University, currently he is assistant professor. His research is focused mainly on image segmentation, 2-D object recognition, evaluation of computer vision algorithms and time series analysis.
Antonio Manuel Durán Rosal received the B.S. degree in Computer Science in 2014, the M.Sc. degree in Computer Science in 2016 from the University of Córdoba. He received the Ph.D. in Computer Science from the University of Madrid (Spain) in 2018. Since 2018, he has been working Department of Quantitative Methods, Loyola University of Andalucía, Spain. Currently he is assistant professor. His current interests include a wide range of topics concerning machine learning, pattern recognition and time series analysis.