A new approach for optimal offline time-series segmentation with error bound guarantee

https://doi.org/10.1016/j.patcog.2021.107917Get rights and content

Highlights

  • An optimal offline time-series segmentation with error bound guarantee is proposed (OSFS method).

  • The OSFS method is based on finding the shortest path in a directed graph.

  • In order to reduce the computational time, the feasible space method (FS), proposed by Liu, is used.

  • A new performance measure to evaluate the performance of heuristic and metaheuristic methods has also been proposed.

  • These results demonstrate that the L-infinity norm produces better results than the L-norm.

Abstract

Piecewise Linear Approximation is one of the most commonly used strategies to represent time series effectively and approximately. This approximation divides the time series into non-overlapping segments and approximates each segment with a straight line. Many suboptimal methods were proposed for this purpose. This paper proposes a new optimal approach, called OSFS, based on feasible space (FS) Liu et al. (2008)[1], that minimizes the number of segments of the approximation and guarantees the error bound using the L-norm. On the other hand, a new performance measure combined with the OSFS method has been used to evaluate the performance of some suboptimal methods and that of the optimal method that minimizes the holistic approximation error (L2-norm). The results have shown that the OSFS method is optimal and demonstrates the advantages of L-norm over L2-norm.

Introduction

A time-series Ts is a sequence of data points ordered in time such that Ts=(t1,t2,,tm) where t1,t2,,tm are individual observations and m is the number of observations in a time series.

Time series have been applied in many areas such as medicine [2], economic [3], telecommunications [4] and online signature verification [5]. Due to the large amount of data used in most application areas, many methods have been proposed to reduce its size without losing relevant information [6].

In the context of this work, segmentation consists of dividing the time series into relevant points, cut points (CP), to reduce its dimensionality by means of a new representation space [7]. The problem of representation is a core issue in pattern recognition, and can dramatically impact the classification performance as well as the computational resources required to solve a particular problem, and the interpretability of the solutions found [8].

For this purpose, one of the most used techniques is called Piecewise Linear Approximation (PLA). PLA divides a time series into segments and uses a linear function to approximate each segment. There are two types of linear approximation [1]: linear interpolation that uses the straight line connecting the two endpoints of one segment to represent the data points in the segment and generates continuous piecewise lines; and linear regression that uses the regression line to approximate a segment and produces a set of disjointed lines.

Linear interpolation produces a smooth approximation and has low computational complexity because the number of straight lines is much smaller than the number of points of the time series.

The error of the approximation can be calculated by using the L2-norm and the L-norm. The first one is computed from the sum of the squared vertical differences, between the straight line obtained in the segmentation and the real data points, squared. This value is called the integral squared error (ISE). The second one is computed from the maximal vertical difference between all the straight lines obtained in the segmentation and the real data points. This value is called maximum error (emax). ISE and emax are calculated as:ISE=i=1nei2emax=max1ineiwhere ei is the vertical distance from the real data Pi to its corresponding straight line and n is the number of points of the time series.

Depending on whether the objective is to minimize the error or the amount of information [9], the problem can be addressed by obtaining the best segmentation of a time series using K segments (the holistic approximation error ISE or emax is minimized), or by obtaining the best segmentation of time series such that the maximum error for any segment (emax) or the accumulated error of all segments (ISE) is less than some prefixed threshold (the amount of information used to represent the time series is minimized).

Many works have been proposed to obtain the best segmentation using K segments, where L2-norm is mostly used. However, there are two main drawbacks when L2-norm is applied [10]: firstly, this constraint can not generate error-guaranteed representations for streaming data since the stream is naturally unbounded in size; secondly, the L2-norm is not able to control the approximation error on individual stream data items. To avoid these drawbacks, other methods based on L-norm were proposed.

Taking into account how the cut points are obtained, segmentation methods can be categorized into three major groups of approaches [9]: Sliding Windows, where the segment increases until a preset error is exceeded and the new segment starts from the last point that does not exceed the preset error; Top-Down, where the time series is divided into smaller and smaller segments until a predetermined error is reached; and Bottom-Up, where, starting from the largest possible number of segments, these are merged until a predetermined error is exceeded.

Other techniques based on metaheuristics (Genetic Algorithms and particle swarm optimisation) were proposed [11], [12].

Depending on whether the time series is fixed in size or grows dynamically, the methods are classified as offline or online respectively. Offline methods segment the complete time series and online methods obtain segments based on the data seen so far. Sarker [13] compared the static with the dynamic segmentation considering if the number of segments is prefixed or not.

All the methods mentioned are suboptimal. Due to its high computational complexity, optimal methods can hardly be used in real-time applications. However, they can be used in order to evaluate the performance of suboptimal methods. An optimal offline method (OSTS) was proposed in Carmona-Poyato et al. [14] which is based on the A* algorithm, L2-norm and linear interpolation. Its computational time was greatly reduced by using pruning algorithms. This method was used to obtain the performance of some suboptimal methods based in L2-norm.

Xie et al. [10] proposed an optimal online method, based on L-norm and linear regression. A linear interpolation-based method was also proposed by Xie but is not guaranteed to be optimal.

In this work, we propose a new optimal offline method, called OSFS, based on feasible space (FS) [1] that uses L-norm and linear interpolation.

As it mentioned before, Xie [10] showed the main drawbacks of the L2-norm versus the L-norm. Since our proposal is optimal taking into account the L-norm, it will be compared to the optimal method (OSTS) proposed in Carmona-Poyato et al. [14], which is based on the L2-norm.

The present paper is arranged as follows. Section 2 describes the methods that were compared with the proposed method and whose performances were evaluated. Section 3 explains the new proposal. The experiments and results are detailed in Section 4. Finally, the main conclusions are summarized in Section 5.

Section snippets

Related work

In this section, some suboptimal time series segmentation methods will be described and their performances will be compared using the OSFS method and the new performance measure.

The first group of methods are heuristic and the second one are metaheuristic.

Proposed method (OSFS method)

In this work, a new optimal offline segmentation method called Optimal Segmentation based on Feasible Space (OSFS) is proposed. This method minimizes the number of segments with error bound guarantee. Therefore, given a maximum error allowed, based on the L-norm, the number of cut points (or segments) that approximates the time series must be minimized. Since there can be several solutions, one will be obtained that also minimizes the value of ISE (L2-norm). The solution to our problem is

Experiments and results

This section shows the time series considered to evaluate the different methods, the experimental setting and the results obtained. The OSFS method has been implemented in C++ and all the experiments were run using an Intel(R) Core(TM) i7-870 K CPU at 3.70 GHz with 64 GB of RAM. The code and the time series used are available at https://github.com/ma1capoa/OSFS_Method

Conclusions and future improvements

The conclusions of this work can be summarized as follows. The present work proposes an optimal time-series segmentation with error bound guarantee (L-norm). Taking into account that several optimal solutions are possible, the one that minimizes the RSME value (L2-norm) is obtained. In order to reduce the computational time by pruning suboptimal solutions, the feasible space method (FS) proposed by Liu [1] is used to obtain the possible successors of a cut point with error bound guarantee. On

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work has been developed with the support of the Research Projects TIN2016-75279-P and TIN2017-85887-C2-1-P of Spain Ministry of Economy, Industry, and Competitiveness, and FEDER.

A. Carmona-Poyato received his title of Agronomic Engineering and Ph.D. degree from the University of Cordoba (Spain), in 1986 and 1989, respectively. Since 1990 he has been working with the Department of Computing and Numerical Analysis of the University of Cordoba as lecturer. His research is focused on image processing, 2-D object recognition and time series analysis.

References (26)

  • C. Cortes et al.

    Hancock: a language for extracting signatures from data streams

    Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2000)
  • M. Okawa, Time series averaging and local stability weighted dynamic time warping for online signature verification,...
  • E. Keogh et al.

    Segmenting time series: a survey and novel approach

    Data Min. Time Ser. Databases

    (2004)
  • Cited by (5)

    • Adaptive error bounded piecewise linear approximation for time-series representation

      2023, Engineering Applications of Artificial Intelligence
    • Optimal online time-series segmentation

      2023, Knowledge and Information Systems
    • Optimal Segmented Linear Regression for Financial Time Series Segmentation

      2021, IEEE International Conference on Data Mining Workshops, ICDMW

    A. Carmona-Poyato received his title of Agronomic Engineering and Ph.D. degree from the University of Cordoba (Spain), in 1986 and 1989, respectively. Since 1990 he has been working with the Department of Computing and Numerical Analysis of the University of Cordoba as lecturer. His research is focused on image processing, 2-D object recognition and time series analysis.

    N.L. Fernández-García received the Bachelor degree in Mathematics from Complutense University of Madrid (Spain) in 1988. He received the Ph.D. in Computer Science from the Polytechnic University of Madrid (Spain) in 2002. Since 1990, he has been working with the Department of Computing and Numerical Analysis at Córdoba University, currently he is assistant professor. His research is focused on edge detection, 2-D object recognition, evaluation of computer vision algorithms and time series analysis.

    F.J. Madrid-Cuevas received the Bachelor degree in Computer Science from Malaga University (Spain) and the Ph.D. degree from Polytechnic University of Madrid (Spain), in 1995 and 2003 respectively. Since 1996 he has been working with the Department of Computing and Numerical Analysis of Cordoba University, currently he is assistant professor. His research is focused mainly on image segmentation, 2-D object recognition, evaluation of computer vision algorithms and time series analysis.

    Antonio Manuel Durán Rosal received the B.S. degree in Computer Science in 2014, the M.Sc. degree in Computer Science in 2016 from the University of Córdoba. He received the Ph.D. in Computer Science from the University of Madrid (Spain) in 2018. Since 2018, he has been working Department of Quantitative Methods, Loyola University of Andalucía, Spain. Currently he is assistant professor. His current interests include a wide range of topics concerning machine learning, pattern recognition and time series analysis.

    View full text