1 Introduction

Multivariate time series data arise in many domains including the web (Xu et al. 2018b), sensor networks (Zhang et al. 2019a), database systems (Van Aken et al. 2017), finance (Zhu and Shasha 2002) and others. Time series from the above domains often exhibit both seasonal behavior and long-term trends (Saad et al. 1998; Luo et al. 2005). For example, city traffic levels (Jindal et al. 2013) and domestic energy consumption (Hong et al. 2016) both have an inherent period related to the regularity of human activity (daily/weekly oscillations) and long term trends. Consider, for example, Google flu searches (Google 2014) over 13 years at one week granularity visualized in Fig. 1. Search rates exhibit annual seasonality, while in the long term neighboring countries share a decreasing trend. In this type of time series, a common problem across domains is the detection of anomalous time points in each of the co-evolving time series. How to effectively detect anomalous time points in seasonal time series with long-term trends without prior knowledge and supervision?

Fig. 1
figure 1

An example of multivariate time series with seasonality, trends, and different types of anomalies: a The map shows three neighbouring countries in south America: Peru, Brazil and Bolivia; b the graph shows weekly time series of Google flu searches for these three countries spanning from 2002 to 2015. Point and segment annotations are predicted anomalies

Accurate anomaly detection enables a host of applications including health monitoring and risk prevention (Vallis et al. 2014), financial fraud loss protection (Goldstein 2014) and data cleaning (Zhang et al. 2017). Anomalies are defined as “points” that deviate significantly from the normal state and are differentiated into two types: point and contextual anomalies. Point anomalies, denoted as circles in Fig. 1, consist of a single outlying observation which stands out. Contextual anomalies (segments in Fig. 1) consist of sequences of outliers, which can be mistaken for normal behavior due to their persisting nature, and hence require long-term contextual information to be detected. The detection of anomalies in multivariate time series elucidates the behavior of a system holistically—both normal and abnormal. For instance, it can help us make sense of patterns and anomalies in the co-evolving national time series of cases of COVID-19.

Anomaly detection is often formulated as a classification problem (Liu et al. 2015; Laptev et al. 2015), and as such, it requires supervised model learning. However, the labels for anomalies are rare and typically expensive to obtain, leading to a surge of interest in unsupervised approaches, including statistical methods (Manevitz and Yousef 2001; Hautamaki et al. 2004; Breunig et al. 2000) and deep-learning (An and Cho 2015; Zhang et al. 2019a). While some existing methods do not rely on explicit anomaly annotations, they have three fundamental limitations: (1) they require normal (non-anomalous) data, (2) they are sensitive to small local variations, and (3) they do not offer model interpretability.

In this work, we propose an unsupervised offline (or batch) method, called AURORA, to detect anomalies in multivariate time series with an explicit estimation of the normal behavior as a mixture of seasonality and trends. Our formulation of normal behavior is flexible and allows capturing diverse temporal patterns. Namely, we formulate seasonality and trend fitting as an optimal reconstruction problem employing the Ramanujan periodic dictionary and a spline dictionary, respectively. This new framework ensures the accurate discovery of interpretable normal behavior while highlighting anomalies.

Our contributions in this paper are as follows:

Novel formulation We introduce the problem of anomaly detection in seasonal multivariate time series with long-term non-linear trends.

Interpretability Our framework AURORA automatically detects anomalies and simultaneously obtains an interpretable model of the seasonal and trend components in the observed time series.

Applicability AURORA ensures strong quantitative improvements on synthetic and real-world datasets in anomaly detection (up to \(30\%\) improvement on AUC), and superior scalability (14 seconds in data with half a million time points and 200 univariate series) when compared to state of the art baselines.

2 Related work

2.1 Anomaly detection

Existing methods can be broadly categorized into supervised and unsupervised. Supervised anomaly detection methods such as one-class SVM (Manevitz and Yousef 2001) and Isolation Forest (Liu et al. 2008) employ labeled anomaly data and pose the problem as binary classification (Aminikhanghahi and Cook 2017). Since labels for anomalies in time series are expensive to obtain and largely unavailable (Zhang et al. 2019a), we focus on the unsupervised case. In unsupervised settings, distance-based methods, such as kNN (Hautamaki et al. 2004), are commonly used to quantify the difference between normal and anomalous samples. Subspace learning methods have also been proposed where the goal is to identify a subspace in which the difference between normal and anomalous samples are more pronounced (Keller et al. 2012). Both distance-based and subspace-based methods are designed for anomaly detection in datasets of independent samples and do not consider the temporal structure of time series. Thus, they are not well-suited to localize anomalies in time series.

To account for the temporal structure, some methods explicitly model temporal patterns while others utilize comparisons across time. In the former category normal behavior with multiple periods is rarely considered and many time series models are restricted to the univariate case. The Autoregressive Moving Average framework (Chen and Liu 1993) accounts for temporal correlations and can be extended to include seasonality, but is sensitive to noise and is limited to single-period time series. TwitterR (Hochenbaum et al. 2017) applies the Extreme Studentized Deviate outlier test after decomposing the univariate time series into its median, seasonal and residual component. This method assumes that periodicity is known and only allows a single period. Other methods map the time series in some feature space (Chan and Mahoney 2005), a process which requires domain knowledge to construct informative features. Methods relying on comparison across time declare a time window as anomalous if its distance to past windows is too large (Wei et al. 2005). However, it is challenging to select an optimal window size for the analysis. Such methods also requires some supervision in that it relies on a sanitized (anomaly-free) period of observations. Matrix profile (Yeh et al. 2016) is another popular distance-based method. It profiles the distance of all fixed-length subsequences to their top-1 closest subsequence. Since Matrix Profile operates at the subsequence level, and thus can declare only windows of pre-selected size as anomalous, it is not a good fit for our task of pinpointing individual anomaly time points or contextual anomalies of variable length (De Paepe et al. 2019).

Given the wide variety of structures in multivariate time series, recently deep learning techniques have gained increasing attention in anomaly detection (Zhang et al. 2019a; Hundman et al. 2018; An and Cho 2015; Xu et al. 2018a; Zhou and Paffenroth 2017). In Zhang et al. (2019a), the authors developed a multi-scale model that captures the temporal patterns. Authors of Hundman et al. (2018) compare LSTM model predictions to actual observations and use thresholding to detect anomalies. Several variational autoencoder (VAE) methods have also been proposed for this task (An and Cho 2015; Xu et al. 2018a; Zhou and Paffenroth 2017). The recent meta-analysis by Zhang et al. (2019a) demonstrates that such deep learning methods do not exploit the temporal structure, suffer from sensitivity to noise, and do not offer interpretation. In addition, deep learning anomaly detectors also require normal (anomaly-free) data to train the underlying “normal” model. Therefore, these methods are not applicable to fully unsupervised scenarios where normal data is not available. As we demonstrate experimentally, our method consistently outperforms representative methods from this group, even if they are given access to normal data in both synthetic and real-world applications.

2.2 Change point detection in time series

Change points can be viewed as a specific type of anomaly where the change is long-lasting in nature. Statistical change point detection methods generally assume independent and identically distributed data within a temporal segment to establish probabilistic models, and cannot be applied directly on data with periodicity and trends. They require multiple observations in each segment for accurate estimation (Zhang et al. 2017), and hence may be limited to detecting contextual anomalies. Some methods require knowledge of the underlying data distributions (Killick et al. 2012), while others constrain the target change point types, for instance focusing on level shifts (Bleakley and Vert 2011; Zhang et al. 2019b).

2.3 Period learning

Traditional period detection methods have employed Fourier transform (Li et al. 2010; Indyk et al. 2000) and work in the frequency domain to determine pronounced periods. The major drawback of such methods is the detection of a large number of spurious periods (Tenneti and Vaidyanathan 2015). Auto-correlation is another approach for period learning (Davies and Plumbley 2005) which employs similarity among time segments. Methods in this category rely on a predefined threshold for selecting dominant periods and often employ heuristic post-processing or integration with Fourier spectrograms (Vlachos et al. 2005). Recently, a dictionary-based period learning framework has been proposed by Tenneti and Vaidyanathan (2015), comprised of a family of periodic dictionaries and a unified convex model for period detection. Recent work has offered improvements by exploiting the group structure in the dictionary (Zhang et al. 2020; Zhang and Bogdanov 2020).

Period learning methods were also proposed for binary sequence data (Li et al. 2015; Yuan et al. 2017). These methods assume that the time series have only one period and rely on prior information about the series. For example, the model in Li et al. (2015) requires an appropriate segmentation threshold. More importantly, these methods are only applicable to binary sequences and are not suited to deal with general time series.

3 Preliminaries and notation

3.1 Notation

We denote by \({A}_i\), \({ A}^i\), \({ A}_{ij}\) the i-th column, the i-th row, and the ij-th element of matrix A, respectively. Throughout the paper, \(\left\| \cdot \right\| _F\), \(\left\| \cdot \right\| _1\) and \(\left\| \cdot \right\| _*\) denote the Frobenius, L1 and nuclear norms. The nuclear norm is defined as \(\left\| A \right\| _* = \sum _i \sigma _i\), where \(\sigma _i\) is the i-th singular value of A. I denotes the identity matrix.

3.2 The Ramanujan periodic dictionary

The Ramanujan periodic dictionary was proposed by Tenneti and Vaidyanathan (2015) to learn underlying periods in univariate time series. For a given period g, the Ramanujan periodic dictionary is defined as a nested matrix:

$$\begin{aligned} {\varPhi }_g = \left[ {P}_{d_1}, {P}_{d_2},\ldots {P}_{d_K} \right] , \end{aligned}$$
(1)

where \(\left\{ d_1,d_2,\ldots ,d_K \right\} \) are the divisors of the period g sorted in an increasing order; \({P}_{d_i}\in \mathbb {R}^{g\times \phi \left( d_i \right) } \) is a periodic basis matrix for period \(d_i\), where \(\phi \left( d_i \right) \) denotes the Euler totient function evaluated at \(d_i\) and \(d = \sum _i\phi \left( d_i \right) \). The basis matrix here is constructed based on the Ramanujan sum (Tenneti and Vaidyanathan 2015):

$$\begin{aligned} C_{d_i}(g) = \sum _{k=1,gcd(k,{d_i})=1}^{d_i} e^{j2\pi kg/{d_i}}, \end{aligned}$$
(2)

where \(gcd(k,d_i)\) denotes the greatest common divisor of k and \({d_i}\). The Ramanujan periodic basis matrix is constructed as a circulant matrix as follows:

$$\begin{aligned} { P}_{d_i} = \begin{bmatrix} C_{d_i}(0) &{} C_{d_i}(g-1) &{} \ldots &{}C_{d_i}(1) \\ C_{d_i}(1)&{}C_{d_i}(0) &{} \ldots &{}C_{d_i}(2) \\ \ldots &{} \ldots &{} \ldots &{} \ldots \\ C_{d_i}(g-1) &{} C_{d_i}(g-2) &{} \ldots &{}C_{d_i}(0) \end{bmatrix}. \end{aligned}$$
(3)

To this end, we can obtain the overall Ramanujan periodic dictionary R for maximum period \(g_{max}\) as \({R} = \left[ {\varPhi }_1 ,.., {\varPhi }_{g_{max}} \right] \), where R is also called a nested periodic matrix (NPM). To analyze time series of length T, the columns in R are periodically extended to a length of T. One can reconstruct time series as a linear combination of a few columns from R, where the dominant periods of time series correspond to high-magnitude coefficients.

3.3 Spline dictionary

PB-spline regression (Goepp et al. 2018; Eilers and Marx 2010) is a flexible method to fit curves using B-splines of degree-d with a smoothness regularization. Quadratic or cubic B-splines are sufficient for most applications (Eilers and Marx 2010). A spline of degree d with k distinct interior knots \(\left\{ u_1, \dots , u_k \right\} \) is a function constructed by connecting and summing polynomial segments of degree d. We construct the spline from B-splines basis functions \(B_{i,d}(u)\) which can be defined recursively by the Cox-de-Boor formula: \(B_{i,0} = 1\) \(\text {if}\ u_i \le u < u_{i+1}\), 0 otherwise.

$$\begin{aligned} B_{i,p} = \frac{u-u_i}{u_{i+p}-u_i}B_{i,p-1}(u) + \frac{u_{i+p+1}-u}{u_{i+p+1}-u_{i+1}}B_{i+1,p-1}(u) \end{aligned}$$

Each \(B_{i,d}(u)\) is non-zero on \([u_i, u_{i+d+1})\). The resulting spline is a linear combination of the basis functions. A sequence of equally-spaced knots is often specified, with a regularization term to penalize for overfitting and to encourage smoothness of the fitted curve.

4 Problem definition

In this paper we study the anomaly detection problem on multivariate time series. Suppose we are given a multivariate time series matrix \({X}\in \mathbb {R}^{t \times n}\) with t time points and n samples. We model this input as a mixture of three components:

$$\begin{aligned} \begin{aligned} { X} = { X}_{T}+{X}_{S}+ { O} +{\delta }, \end{aligned} \end{aligned}$$
(4)

where \({ X}_{ S}\) denotes the seasonal component, \( { X}_{ T}\) denotes the trend component, O represents anomalies, and \(\delta \) is random noise. In general, prior information about these three components is not available, therefore, we propose to minimize the following objective to model them:

$$\begin{aligned} \begin{aligned} \underset{ X_S, X_T }{\mathrm {argmin}} \quad \left\| X-X_S-X_T \right\| _1+R_1(X_S)+ R_2(X_T), \end{aligned} \end{aligned}$$
(5)

where anomalies are computed as the residual: \(O=X-X_S-X_T\). In the above objective, \(R_1(X_S)\) and \(R_2(X_T)\) are regularization terms for \(X_S\) and \(X_T\), respectively. Note that minimizing the L1 norm reconstruction cost is robust to anomalies, therefore, the objective can capture the \({X}_{T}\) and \({ X}_{S}\) components without being sensitive to distortion. The two learned components comprise the normal temporal behavior of the data.

4.1 Seasonality modeling \(R_1(X_S)\)

We impose structure on the periodic component by harnessing the Ramanujan periodic dictionary. Namely, we convert this problem as a sparse coding problem into follows,

$$\begin{aligned} \underset{ U }{\mathrm {argmin}} \quad \uplambda _1\left\| U \right\| _1,\quad s.t. \quad X_{S}= GU \end{aligned}$$
(6)

where \({G} = { R}{H}^{-1}\in \mathbb {R}\) and \(\uplambda _1\) is a balance parameter. Here, H is a diagonal matrix for penalizing large periods in R when sparse coding, where \({ H}_{ii} = p^2\) and p is the period of the i-th column in the R. Finally, the coefficients of periods can be obtained through \(\hat{U} = H^{-1}U\).

4.2 Trend modeling \(R_2(X_T)\)

Our second goal is to impose structure on the trend component \({ X}_{T}\). Trends do not typically follow a parametric regular shape and users have no prior information about it, therefore, we employ a spline-based approximation, which is an effective nonparametric smooth shape estimation solution based on polynomial functions. We introduce a spline dictionary, denoted as \(A\in {\mathbb R}^{t \times m}\), where each column represents a spline basis function. We can use the degree-d B-spline basis defined over k internal knots, such that \(m = k+d+1\). We employ equally-spaced knots to construct the dictionary. To ensure separability between the periodic and trend components, we pre-process the spline basis S to be orthogonal to the periodic dictionary R. Let \({ \tilde{R}}\) be an orthonormal version of R, which can be constructed by the Gram-Schmidt process. \({ \tilde{R}}\) has the same dimensions as R since the latter forms a basis. We can then find the component of S perpendicular to the subspace spanned by the columns of \({ \tilde{R}}\) by \({ A} = { S} - { \tilde{R}} { \tilde{R}}'{ S}\).

By treating A as the underlying factors, \( { X}_{T}\) can be linearly reconstructed in terms of these factors as follows:

$$\begin{aligned} \begin{aligned} \underset{W }{\mathrm {argmin}} \quad \uplambda _2 \left\| W \right\| _* + \uplambda _3\left\| DW \right\| _F^2, s.t. \quad X_{T} = AW, \end{aligned} \end{aligned}$$
(7)

where W is the coefficient matrix and D is the matrix form of the difference operator; and \(\uplambda _2\) and \(\uplambda _3\) are balance parameters. Multivariate time series are often collected from the same system, thus they often share similar global trends. To this end, we impose a low-rank constraint on W through a nuclear norm penalty. We also incorporate a selection regularization on W as \(\left\| DW \right\| _F^2\) to prevent overfitting and to encourage smoothness of the fitted curve. We define this difference penalty by introducing the cases of 1st \(\sim \) 3rd order difference as follows:

$$\begin{aligned} \begin{aligned} {\left\{ \begin{array}{ll} \left[ DW \right] _{ij} = {W}_{ij}-W_{i-1,j} &{} \text { 1th-order } \\ \left[ DW\right] _{ij} = W_{ij}-2W_{i-1,j} +W_{i-2,j}&{} \text { 2th-order } \\ \left[ DW \right] _{ij} = W_{ij}-3W_{i-1,j} +3W_{i-2,j} - W_{i-3,j} &{} \text { 3th-order.} \end{array}\right. } \end{aligned} \end{aligned}$$
(8)

We use \(3^{rd}\) order as the default setting, except when noted otherwise. Note that the selection of the type of spline basis functions in dictionary A is flexible, but certain choices such as truncated polynomials are known to be prone to numerical instability.

4.3 Overall AURORA objective

By integrating all the above, we arrive at the objective function for anomaly detection on multivariate time series as follows:

$$\begin{aligned} ~ \begin{aligned} \underset{ U,W }{\mathrm {argmin}} \quad \left\| {X}- GU -AW\right\| _1 +\uplambda _1\left\| U \right\| _1 +\uplambda _2 \left\| W \right\| _* + \uplambda _3\left\| DW \right\| _F^2. \end{aligned} \end{aligned}$$
(9)

Instead of modeling outliers explicitly, we detect anomalies from the residuals of \({O} = {X}- GU -AW\). Intuitively, anomalies diverge from normal components, i.e. seasonal and trend component, and thus, lead to large residuals. We produce a ranked list of possible anomaly locations corresponding to high magnitude of the residuals. Without any assumptions on anomaly lengths, this model allows us to detect any deviation from the normal state, instead of being restricted to only detecting certain types of anomalies.

5 Optimization

Since the objective function in Eq. 9 is not jointly convex, we optimize the two variables alternatively using the Alternating Direction Method of Multiplier (ADMM) framework (Boyd et al. 2011). We first introduce auxiliary variables: \({ V} = { U}\), \( P= { W}\) and \(Y= X- GU -AW\). Then, we rewrite Eq. 9 as follows:

$$\begin{aligned} ~ \begin{aligned}&\underset{ U,W,V,P,Y }{\mathrm {argmin}} \quad \left\| Y\right\| _1 +\uplambda _1\left\| V \right\| _1 +\uplambda _2 \left\| P\right\| _* + \uplambda _3\left\| DW \right\| _F^2\\&s.t \quad V =U, P =W, Y= X- GU -AW. \end{aligned} \end{aligned}$$
(10)

The corresponding Lagrangian function is:

$$\begin{aligned} \begin{aligned}&\mathcal {L}\left( {U,W,V,P,Y,} \Lambda _1,\Lambda _2,\Lambda _3 \right) = \left\| Y\right\| _1 +\uplambda _1\left\| V \right\| _1 +\uplambda _2 \left\| P\right\| _* + \uplambda _3\left\| DW \right\| _F^2\\&+\left\langle \Gamma _1, V -U \right\rangle +\frac{\rho _1}{2} \left\| V-U \right\| _F^2 +\left\langle \Gamma _2, P -W \right\rangle +\frac{\rho _2}{2} \left\| P-W \right\| _F^2 \\&+\left\langle \Gamma _3, Y- \left( X- GU -AW \right) \right\rangle +\frac{\rho _3}{2} \left\| Y- \left( X- GU -AW \right) \right\| _F^2 \end{aligned} \end{aligned}$$
(11)

where \(\Gamma _1 \sim \Gamma _3 \) are the Lagrangian multipliers and \(\rho _1 \sim \rho _3\) are penalty parameters.

5.1 Update Y

The subproblem w.r.t. V is as follows:

$$\begin{aligned} \begin{aligned} \underset{ Y }{\mathrm {argmin}} \quad \left\| Y\right\| _1+\frac{\rho _3}{2} \left\| Y- \left( X- GU -AW \right) + \frac{\Gamma _3}{\rho _3} \right\| _F^2\\ \end{aligned} \end{aligned}$$
(12)

This problem can be solved based on the following Lemma from Lin et al. (2013).

Lemma 1

For \(\alpha >0\), the following objective has a closed-form solution

$$\begin{aligned} \underset{ A }{\mathrm {argmin}} \quad \frac{1}{2}\left\| A-B \right\| _F^2 +\alpha \left\| A \right\| _1 \end{aligned}$$
(13)

which is written as \(A_{ij} = sign\left( B_{ij}\right) \times max\left( \left| B_{ij} \right| -\alpha ,0 \right) \). Here, \( sign\left( t \right) \) is the signum function defined as:

$$\begin{aligned} sign\left( t \right) ={\left\{ \begin{array}{ll} 1 &{} \text { if } t>0 \\ -1&{} \text { if } t<0 \\ 0&{} \text { if } t=0 \end{array}\right. } \end{aligned}$$
(14)

Based on this lemma, we obtain a closed-form solution for Y:

$$\begin{aligned} \begin{aligned} Y_{ij} = sign\left( E_{ij}\right) \times max\left( \left| E_{ij} \right| -\frac{1}{\rho _3},0 \right) \end{aligned} \end{aligned}$$
(15)

where \( E= \left( X- GU -AW \right) - \frac{\Gamma _3}{\rho _3}\).

5.2 Update V

The subproblem w.r.t. V is as follows:

$$\begin{aligned} \begin{aligned} \underset{ V }{\mathrm {argmin}} \quad \uplambda _1\left\| V \right\| _1 + \frac{\rho _1}{2} \left\| V-U + \frac{\Gamma _1 }{\rho _1} \right\| _F^2 \end{aligned} \end{aligned}$$
(16)

We similarly obtain a closed-form solution for V employing Lemma 1:

$$\begin{aligned} \begin{aligned} V_{ij} = sign\left( H_{ij}\right) \times max\left( \left| H_{ij} \right| -\frac{\uplambda _1}{\rho _1},0 \right) , \end{aligned} \end{aligned}$$
(17)

where \(H = U - \frac{\Gamma _1 }{\rho _1}\).

5.3 Update P

The subproblem w.r.t. P is as follows:

$$\begin{aligned} \begin{aligned} \underset{ P }{\mathrm {argmin}} \quad \uplambda _2 \left\| P\right\| _*+ \frac{\rho _2}{2} \left\| P-W + \frac{\Gamma _2 }{\rho _2} \right\| _F^2 \end{aligned} \end{aligned}$$
(18)

According to the singular value thresholding (SVT) method (Cai et al. 2010), we can compute a closed-form solution for P as well. By setting \(M = W -\frac{\Gamma _2 }{\rho _2}\), we first take the singular value decomposition of M as \(M =J\Sigma K^T \), where J, K and \(\Sigma \) denote left-singular vectors, right-singular vectors, and singular values, respectively. Then, we obtain the solution for P as \(P =J\mathcal {S}\left( \Sigma \right) K^T \), where \(\mathcal {S}\left( \Sigma \right) = diag\left[ max\left( \sigma _i-\frac{\uplambda _2}{\rho _2},0 \right) \right] \) and \(\sigma _i\) is the ith singular value.

5.4 Update U

The subproblem w.r.t. U is:

$$\begin{aligned} \begin{aligned} \underset{ U }{\mathrm {argmin}}\quad \frac{\rho _1}{2} \left\| V-U + \frac{\Gamma _1 }{\rho _1} \right\| _F^2 +\frac{\rho _3}{2} \left\| Y- \left( X- GU -AW \right) + \frac{\Gamma _3}{\rho _3} \right\| _F^2 \end{aligned} \end{aligned}$$
(19)

By taking the gradient w.r.t. U and equating it to zero, we obtain:

$$\begin{aligned} \begin{aligned} \rho _1U-\left( \rho _1 V+\Gamma _1 \right) +\rho _3G^T\left( GU-X+AW+Y + \frac{\Gamma _3}{\rho _3}\right) =0\\ \end{aligned} \end{aligned}$$
(20)

We get the closed-form solution of U as follows:

$$\begin{aligned} \begin{aligned} U = \left( \rho _1 I+\rho _3G^TG \right) ^{-1}\left[ \rho _1 V+\Gamma _1+ \rho _3G^T\left( X-AW-Y - \frac{\Gamma _3}{\rho _3}\right) \right] \end{aligned} \end{aligned}$$
(21)

5.5 Update W

The subproblem w.r.t. W is as follows:

$$\begin{aligned} \begin{aligned}&\underset{ W }{\mathrm {argmin}} \quad \uplambda _3\left\| DW \right\| _F^2+\frac{\rho _2}{2} \left\| P-W + \frac{\Gamma _2 }{\rho _2} \right\| _F^2+\frac{\rho _3}{2} \left\| Y- \left( X- GU -AW \right) + \frac{\Gamma _3}{\rho _3} \right\| _F^2 \end{aligned} \end{aligned}$$
(22)

By setting the above objective’s derivative w.r.t. W to zero, we obtain:

$$\begin{aligned} \uplambda _3 D^TDW +\rho _2\left( W-P- \frac{\Gamma _2}{\rho _2}\right) +\rho _3 A^T\left( AW-GU+Y+ \frac{\Gamma _3}{\rho _3} \right) =0\nonumber \\ \end{aligned}$$
(23)

As a result, a closed-form solution for W is as follows:

$$\begin{aligned} W = \left( \uplambda _3 D^TD+\rho _2I+\rho _3 A^TA \right) ^{-1} \left[ \rho _2P + \Gamma _2+ \rho _3 A^T\left( GU-Y- \frac{\Gamma _3}{\rho _3} \right) \right] \nonumber \\ \end{aligned}$$
(24)
figure a

Updates for the Lagrangian multipliers \(\Gamma _1\), \(\Gamma _2\) and \(\Gamma _3\): In the \(i+1\) iteration, the Lagrangian multipliers can be updated as follows: \(\Gamma _1^{i+1} =\Gamma _1^{i} +\rho _1\left( V-U \right) \), \(\Gamma _2^{i+1}=\Gamma _2^{i} + \rho _2\left( P-W \right) \), and \(\Gamma _3^{i+1} =\Gamma _3^{i} + \rho _3\left[ Y- \left( X- GU -AW \right) \right] \).

5.6 Overall algorithm and complexity analysis

We summarize AURORAin Algorithm 1. We repeatedly perform the updates for U and W from Steps 3 to Step 12 until convergence. The most substantial running time cost is due to Steps 5, 7 and 8, while the remaining steps are either of linear complexity or near-linear complexity, e.g. involvivng sparse matrix multiplication. For Step 5, the SVD operation has a complexity of \(O\left( min\left( tn^2,t^2n \right) \right) \). Here, t is often much greater than n, therefore, the complexity of svd is \(O\left( tn^2 \right) \). Both step 7 and 8 involve an inversion of a quadratic matrix, which incurs a cost of \(O(q^3)\) and \(O(m^3)\) in the worst case, respectively. Because of the sparsity in D and I, the complexity of the two can be significantly reduced to \(O(qS_{nnz})\) and \(O(mS_{nnz})\) based on the analysis in Zhang and Bogdanov (2019). In the above, \(S_{nnz}\) is the number of non-zero elements of \(\uplambda _3 D^TD+\rho _2I \) and \(I_{nnz}\) is the number of non-zero elements of I. Therefore, the overall asymptotic complexity for each iteration is \(O\left( tn^2 + qI_{nnz}+mS_{nnz}\right) \). In practice, AURORA often only needs a few iterations to converge. An implementation of our method is available for download at http://www.cs.albany.edu/~petko/lab/code.html.

6 Experimental evaluation

Our goal is to evaluate the performance of our method on data with periodic and trend components with both single point and contextual anomalies. Thus, our selection of datasets, baselines and evaluation metrics are geared towards this setting.

6.1 Datasets

We conduct evaluation experiments on both synthetic and real-world datasets.

6.1.1 Synthetic data

We generate 20-dimensional time series of length 5000 and refer to each univariate time series as a sample. The time series are comprised of four additive components: (1) periodic, (2) trend, (3) anomalies and (4) noise components. We generate the periodic component by following the methodology in Tenneti and Vaidyanathan (2015), namely, we generate a uniformly random sequence of length equal to a pre-specified underlying period and repeat it to obtain a series of the desired length. We select two periods for each time series randomly from \(\{3,5,7,11,13\}\). We use 4-th degree polynomial functions to generate the trends in the time series with two sets of coefficients: \(\left\{ [1, 1,0,-0.1], [1, 0.1,0.1, 0.1] \right\} \). Individual samples are assigned one of the two trend polynomials randomly.

We select random individual time points as positions for point anomalies and add random values to the normal behavior of varying magnitude as outlined in individual experiments. To inject contextual anomalies, we select a position uniformly at random in the time series and add contiguous point anomalies (as described above), of length chosen uniformly from \(\{3, 4, 5, 6\}\). All contextual anomalies are independent in each univariate (sample) time series. We also add Gaussian noise to all time series and control the signal-to-noise ratio (SNR) in the experiments.

6.1.2 Real-world data

We also experiment with time series from 4 real-world datasets, including Power plant (Rosca et al. 2015) and Google flu (Google 2014), Yahoo (Laptev and Amizadeh 2015) and NAB (Lavin and Ahmad 2015). We inject point anomalies in the Power plant (Rosca et al. 2015) and Google flu (Google 2014) following the same protocol as in our synthetic datasets. Note that this is a common evaluation strategy when anomaly labels are not available (Rosca et al. 2015; Emmott et al. 2015). The rest of the datasets have anomaly labels.

\(\bullet \) Power plant (Rosca et al. 2015): This dataset is from the 2015 PHM Society Data Challenge. There are a total of 24 sensors with a sampling rate of 15 minutes. We experiment with the first segment of 10,000 time steps. We randomly inject 6 point anomalies for each sensor (144 in total) by following our synthetic anomaly injection methodology. Anomaly positions are independent in each sample.

\(\bullet \) Google flu (Google 2014): The Google flu dataset consists of weekly estimates for influenza rates based on web searches in 29 countries from 2002 to 2015 (659 time points). We inject 6 point anomalies in the time series of each country.

\(\bullet \) Yahoo (Laptev and Amizadeh 2015): The Yahoo A1 benchmark has 67 time series with labeled anomalies. This is a collection of real traffic metrics from Yahoo! services reported hourly. The lengths of individual time series varies between 741 and 1461. Note that existing literature argues that ground truth in some time series are not reliable (De Paepe et al. 2020). We have annotated such time series in the experiments, but report results on all of them for completeness.

\(\bullet \) NAB (Lavin and Ahmad 2015): The NAB dataset includes time series data from multiple domains with manually labelled anomalies. We report results on the 10 Twitter traffic time series from the benchmark as they fit our assumptions of periodic behavior.

6.2 Experimental setup

6.2.1 Baselines

\(\bullet \) Anomaly detection We compare AURORA on anomaly detection with two baselines: TwitterR (Hochenbaum et al. 2017) and Donut (Xu et al. 2018b). These state-of-the-art anomaly detectors are both flexible in detecting time series patterns of varying length. While Matrix profile (Yeh et al. 2016) is another potential baseline, it operates at a fixed-length subsequence level and is not applicable to detecting anomalies of variable length. Hence, we do not employ it for comparison.

Donut (Xu et al. 2018b) is based on the Variational Autoencoder (VAE) framework. It detects anomalies by scoring dependencies in a time window of a fixed length based on a pre-trained model for anomalies. This method accounts for periodicity and trends in the time series and is, thus, an especially good fit for our setting. The window size is the main parameter in the method. We report the best result based on a grid search on window sizes varying from 10 to 100 with a step of 10. In all experiments a window size with 10 resulted in the best performance.

TwitterR (Hochenbaum et al. 2017) employs the Seasonal Extreme Studentized Deviate (S-ESD), a popular and robust anomaly detection method for univariate seasonal data. Raw data is decomposed into a median component, seasonal component and residuals. The Extreme Studentized Deviate (ESD) test is then applied to the residuals to produce a list of anomaly time points ordered by their probability of being an outlier. Different from our method which learns the underlying periods from data, this algorithm requires a single dominant period as an input. We set this parameter to the minimum true period present in our synthetic time series which puts the method in an advantageous position. We are, however, interested in characterising its quality for anomaly detection in the best possible settings, and thus, control for other factors of inaccuracy. For experiments on real-world data, the true periods are unknown, so we set TwitterR’s period parameter according to the highest-magnitude DFT frequency in each time series.

\(\bullet \) Period detection AURORA learns the underlying periods from data, and hence, we also evaluate its ability to detect GT periods and compare to three baselines:

NPM (Tenneti and Vaidyanathan 2015) is the state-of-the-art period learning method based on periodic dictionaries. It encodes time series employing the Ramanujan periodic dictionary and predicted periods are recovered based on the reconstruction coefficients. Since NPM operates on univariate time series, we apply it on each univariate time series and compile top ranked periods as final predictions.

FFT (Priestley 2004) is a classical period learning method that transforms time series into their frequency domain. Predicted periods have high-magnitude coefficients.

AUTO (Li et al. 2015) combines auto-correlation and Fourier transform. It first calculates the auto-correlation of the input data. Next, it employs Fourier transform on the results from the first step to derive periods of highest magnitude.

6.2.2 Performance metrics

We employ area under the ROC curve (AUC) to quantify the performance in anomaly detection. A true positive (TP) in the case of point anomalies is the correct identification of a time point annotated as a ground truth anomaly. We treat contextual (interval) anomalies as a collection of point anomalies (e.g., a contextual anomaly of length l is treated as l positive instances of anomalies), and evaluate AUC for this case in the same manner. Note that this measure is naturally biased to longer contextual anomalies which correspond to positive examples proportional to their length. Since, in some experiments we consider both point and contextual anomalies in the same time series, we focus on this simple measure that can capture both types. It is also worth noting that more optimistic metrics have been employed where any overlap of a predicted interval with anomalous interval is declared a TP (De Paepe et al. 2019), however, we employ the above time-point-wise metric as it does not leave ambiguity about the correspondence of predicted and GT time points.

To quantify the evaluation of period learning we compare the top-k obtained periods with the ground truth (GT) periods, where k is the number of GT periods. We report the accuracy of period identification for datasets with known GT periods.

6.3 Anomaly detection in synthetic time series

We compare the performance of AURORA and baselines for anomaly detection in three types of settings: point-only, contextual-only, and mixed anomalies (Fig. 2). For this analysis we vary the signal-to-noise ratio (SNR) and report an average AUC of ten samples of data for each setting. With decreasing noise level (or increasing SNR), the average AUC of AURORA increases slightly in all three settings and consistently dominates that of alternatives. In the case of point anomalies, AURORA achieves an AUC of 0.98 at SNR greater than 20, while Donut and TwitterR peak at AUC of 0.83 and 0.68, respectively. AURORA is similarly better than baselines in the cases of contextual anomalies and mixture of anomalies, exhibiting a \(15\%\) improvement over Donut and a \(25\%\) improvement over TwitterR. AURORA’s advantage is due to its explicit modelling of normal periodic and trending behavior in the time series which is built into our synthetic datasets. This allows AURORA to precisely detect time points that deviate from normal.

Fig. 2
figure 2

Comparison of anomaly detection quality for different types of anomalies in synthetic and varying signal-to-noise ratio (SNR) taking values of [10, 40, 70, 100]. We consider a point anomalies; b contextual anomalies (size range: [3, 4, 5, 6]); c mixture of both types

TwitterR’s performance is close to constant at different SNRs as it employs LOESS local smoothing to extract a seasonal component. We parametrize TwitterR with the smallest ground truth period in our synthetic data, and hence, its quality is not affected by incorrect periodicity estimation. However, when this important information is not available (e.g. in real data), its reliance on accurate periodicity estimation becomes a crucial step for TwiterR and a potential weak point, limiting its application. Another key drawback of TwitterR is its assumption of a single period in the data. Our synthetic data features complex/multiple periods, which is another factor for AURORA’s edge in performance over TwitterR in this set of experiments.

The performance of Donut is also close to constant over varying SNRs. This method uses pre-trained models to score anomalies, i.e., its anomaly score function is bounded by the quality of its training data. Specifically, the VAE at the core of Donut should ideally be trained on anomaly-free data, thus, obtaining a model for normal behavior. The presence of anomalies in the input, however, are incorporated in the model, and thus, affect the likelihood scoring of anomalies in testing data. In contrast, AURORA has no requirement for anomaly-free data. Instead, it models anomalies in the testing data directly without pre-training.

6.4 Anomaly detection in real-world data

We next present anomaly detection results in real-world datasets in Fig. 3. We inject anomalies into the Power plant and Google flu time series. Since difference in performance between point-based and context anomalies is minimal (as observed in synthetic data in Fig. 2), we focus on point anomalies and inject those in our two datasets without GT. The magnitude of anomalies plays an important role as anomalies of increasing magnitude are naturally expected to be more discernible. Hence, we evaluate the performance by varying the magnitude of anomalies.

Fig. 3
figure 3

Comparison of AUC for anomaly detection by varying the magnitude of injected anomalies in the a Google flu; b Power plant datasets

Table 1 AUC comparison for anomaly detection in the Yahoo benchmark. The performance of the best method in each benchmark dataset are marked in bold font

The AUC of AURORA and TwitterR grows with the magnitude. The AUC of AURORA is greater than 0.9 in most cases and is about 0.3 higher than that of TwitterR in Power plant and 0.25 higher in Google flu. The single period assumption of TwitterR is not necessarily realistic for these datasets and the more flexible periodicity model in AURORA may partially explain its performance advantage. Donut exhibits worse performance in both datasets with injected anomalies and notably its AUC does not grow with the magnitude of injected anomalies. The key reason for this behavior is that Donut requires anomaly-free data for training, but this is often not possible in real-world applications due to the presence of noisy or unidentified anomalies. As a consequence, since we train it on the actual data with injected anomalies to allow for fair comparison, its performance is proportionately affected by the magnitude of anomalies in training data which gets encoded in the VAE. We use part of the actual data for training the VAE as we do not have access to additional anomaly-free data from the same sources.

The Yahoo and NAB datasets include GT anomaly labels, hence, we present the AUC values for each time series in those benchmarks in Tables 1 and 2. Some time series in the Yahoo benchmark have anomaly labels of questionable quality as reported by others (De Paepe et al. 2019). We mark these time series with unreliable GT labels by an asterisk in Table 1, but show results on all time series for completeness.

AURORA’s anomaly detection quality dominates that of baselines in most time series for both benchmarks. In particular, AURORA obtains the best results in 46 out of the 67 time series in the Yahoo benchmark. We get the best results in 29 out of 34 from the non-questionable labeled time series. In NAB, AURORA gets the best results in 9 out of the 10 time series. Inspecting the cases in which AURORA is not the best method among the baselines reveals that it under-performs in data which does not match well our modeling assumptions of periodicity and trends.

Table 2 AUC comparison for anomaly detection on the NAB benchmark, realTwitter series. The performance of the best method in each benchmark dataset are marked in bold font

6.5 The importance of multivariate analysis

Recall that we address anomaly detection in multivariate time series. Our approach takes full advantage of shared periods and/or trends among individual univariate time series. Univariate anomaly detection treating each time series as independent cannot take advantage of such shared patterns. To demonstrate the utility of multivariate analysis, we compare AURORA and a univariate version which fits periodicity and trends independently for each time series. To this end, we remove the low-rank regularization term \( \uplambda _2 \left\| W \right\| _* \) from Eq. 9 and call the resulting method AURORAuni. In Fig. 4, we present a comparison between the alternatives on synthetic and real-world data. Both results show that AURORA outperforms AURORAuni with an increase of the AUC of at least 0.1.

6.6 A case study: the Google Flu dataset

Next we employ AURORA on the Google Flu dataset without injecting anomalies) in order to qualitatively explore the reported anomalies in the raw data. Overall, we find that anomalous time points fall well within typical flu seasons.

As an example, we dive deeper into the anomalies detected for Brazil, Bolivia and the United States (Fig. 5). In Brazil, the flu season generally spans May to July, which are the southern hemisphere’s winter months (Alonso et al. 2007). Out of the top 50 anomalies detected for Brazil, 44 fall within this time period. The other 6 points all fall in August, with 5 in August 2009 which corresponds with the outbreak of H1N1 flu or swine flue. Brazil registered 7569 new cases of H1N1 flu from August 25 to 29, and later confirmed that the country had the highest number of fatalities from the virus in early September (CNN 2009).

Fig. 4
figure 4

Comparison between AURORA and a univariate alternative AURORAuni on a synthetic and b the Google Flu datasets

Fig. 5
figure 5

Case study on Google flu. Blue diamonds and red circles annotate anomalies detected within and outside of flu season, respectively

In Bolivia, the flu season typically spans April to September (US Embassy in Bolivia 2019). Out of the top 50 anomalies detected for Bolivia, 47 fall within this time period, and the remaining 3 are in October 2005 and 2007 which might signify a slightly longer flu season in those years.

In the United States, the flu season typically spans fall and winter, and flu activity peaks between December and February (Centers for Disease Control and Prevention 2018). 43 out of our 50 top anomalies detected for the US fall within this time period. Another 6 are in September through November 2009, which coincides with the outbreak of the H1N1 influenza virus that peaked in October (Centers for Disease Control and Prevention 2009). The remaining anomaly is in early March 2008, which indicates a slightly longer flu season, as supported by the fact that the peak flu activity that year was in mid-February.

6.7 Period learning on synthetic data

In Fig. 6, we present AURORA’s performance on period learning with varying SNR. AURORA shows superiority over alternatives by consistently obtaining all GT periods. The robustness of AURORA can mostly be attributed to the application of the L1-norm fitting function that is robust to anomalies and noise. In addition, AURORA, detects periods by multivariate analysis and explicitly models trends unlike the alternatives. NPM achieves \(50\%\) accuracy in period detection across settings as it does not model noise or anomalies which may dominate its objective and obs.cure true periods. FFT and AUTO have similar performance as they both employ Fourier Transform. The two methods exhibit different behavior under different types of anomalies. Their accuracy is about \(70\%\) for point anomaly and much lower on contextual and mixed (about \(20\%\)). This is because point anomalies distort the general periodicity of the time series to a lesser extent than the lengthier contextual anomalies.

Fig. 6
figure 6

The comparison of period learning by varying SNR under different types of anomalies

6.8 Parameter sensitivity analysis

We study the sensitivity of AURORA to parameters \(\left\{ \uplambda _1, \uplambda _2, \uplambda _3\right\} \) in the synthetic data. We fix one parameter and vary the other two, and report the AUC for anomaly detection in Fig. 7. AURORA achieves a stable close-to-optimal performance for a wide range of parameters. Its minimum AUC in the studied range is around 0.85 which is still better than competitors. The performance of AURORA degrades slightly when the values of parameters approach 1 as large weights on regularization terms undermine the importance of the data fit cost.

6.9 Scalability

We also measure the CPU running time of the three competing anomaly detection methods on synthetic data by varying the length of time series and the number of samples in Fig. 8. We run AURORA on a desktop Dell computer with Intel(R)- Core(i7) CPUs of 3.20 GHz and 31.2 GB memory using MATLAB R2018b 64-bit edition without parallel operation. Donut is executed on the same machine with Python 3.7 without parallel operation. TwitterR is implemented and executed on Intel(R)- Core(i5) CPUs of 2.7 GHz and 8 GB RAM without parallel operation. In a dataset with 0.5 million time points and 200 samples AURORA completes in 14 seconds. In contrast, on the same data size TwitterR requires 150 hours (over 6 days) to run, and Donut requires 1500 hours (over 2 months), including training and testing. Note that since TwitterR and Donut analyze each univariate time series independently, to obtain these results for the largest data size, we ran only a single univariate time series and multiplied the running time by the number of samples: 200. As a result, AURORA is more than 38,000 times faster than TwitterR and more than 380,000 times faster than Donut. Even if we account for differences between programming languages and CPUs, AURORA is still orders of magnitude faster than the baselines.

Fig. 7
figure 7

Parameter sensitivity of AURORA

Fig. 8
figure 8

Comparison of CPU running time as a function of the time series length and the number of samples (univariate time series) between a AURORA, b TwitterR and c Donut. Note, that AURORA’s time is reported in seconds while those of baselines in hours

7 Conclusions

In this paper, we proposed a novel solution for the problem of anomaly detection in multivariate time series in the presence of seasonality and trends. We introduce an efficient and accurate solution, called AURORA, which offers interpretable summaries of the time series and automatic unsupervised anomaly detection. We applied AURORA on both synthetic and real-world datasets and demonstrated its superior performance compared to that of state-of-the-art baselines. AURORA was able to achieve \(AUC = 0.95\) for anomaly detection even in very high noise regimes, and \(100\%\) accuracy for period learning. In addition, AURORA is orders of magnitude faster than baselines.