1 Introduction

Traditional data analysis techniques deal with feature vectors having deterministic values. Thus, data uncertainty is usually ignored in the problem formulation. However, uncertainty arises in real data in many ways, since the data may contain errors or may be only partially complete (Lindley 2006). The uncertainty may result from the limitations of the equipment, indeed physical devices are often imprecise due to measurement errors. Another source of uncertainty are repeated measurements, e.g. sea surface temperature could be recorded multiple times during a day. Also, in some applications data values are continuously changing, as positions of devices or observations associated with natural phenomena, and these quantities can be represented by using an uncertain model.

Simply disregarding uncertainty may led to less accurate conclusions or even inexact ones. This has created a need for uncertain data management techniques (Aggarwal and Yu 2009) managing data records typically represented by probability distributions (Mohri 2003; Kriegel and Pfeifle 2005; Bi and Zhang 2004; Aggarwal and Yu 2008; Angiulli and Fassetti 2012, 2007; Aggarwal 2014; Khan et al. 2018), In this work it is assumed that an uncertain object is an object that always exists but its actual value is uncertain and modeled by a multivariate probability density function. This notion of uncertain object has been extensively adopted in the literature and corresponds to the attribute level uncertainty model viewpoint (Green and Tannen 2006).

In particular, we deal with the problem of detecting outliers in uncertain data. An outlier is an observation that differs so much from others as to arouse suspicion that it was generated by a different mechanism (Hawkins 1980). As a major contribution, we introduce a definition of uncertain outlier representing the generalization of the classic distance-based outlier definition (Knorr et al. 2000; Ramaswamy et al. 2000; Angiulli and Pizzuti 2005; Angiulli et al. 2006) to the management of uncertain data modeled as arbitrary probability density functions. The distance-based definition is a solid one: it has been introduced in order to overcome some limitations of statistical definitions, generalizes the notion of outlier provided by several discordance tests developed in statistics, is suitable for multivariate data, and can be applied even if the distribution of the data is unknown. The contributions of the work are summarized next.

  • To the best of our knowledge, this is the first unsupervised outlier detection technique working on data objects modeled by means of arbitrarily shaped multidimensional distribution functions.

  • We introduce a novel definition of uncertain outlier representing the generalization of the classic distance-based outlier definition (Knorr et al. 2000; Ramaswamy et al. 2000; Angiulli and Pizzuti 2005) to the management of uncertain data modeled as pdfs.

  • Our approach consists in declaring an object as an outlier if the probability that it has at least k close neighbors is low. Hence, it corresponds to perform a nearest neighbor density estimate on all the possible dataset outcomes. As such, its semantics is completely different from previously introduced unsupervised approaches for outlier detection on uncertain data (Aggarwal and Yu 2008; Wang et al. 2009; Jiang and Pei 2011).

  • We show how the decision rule associated with the here introduced definition, although difficult to compute, can be truthfully implemented.

  • We provide an efficient uncertain distance-based outlier detection algorithm working on any domain and with any distance function.

The rest of the paper is organized as follows. Section 2 introduces the notion of distance-based uncertain outlier. Section 3 discusses work related to the one here presented. Section 4 shows how to compute the outlier probability. Section 5 presents the outlier detection method. Section 6 illustrates experimental results. Finally, Section 7 concludes the work.

2 Preliminaries

2.1 Uncertain objects

Let \((\mathbb {D},\text {dist})\) denote a metric space, where \(\mathbb {D}\) is a set, also called domain, and dist is a metric distance on \(\mathbb {D}\). (e.g., \(\mathbb {D}\) is the d-dimensional real space \(\mathbb {R}^{d}\) equipped with the Euclidean distance dist).

A certain object v is an element of \(\mathbb {D}\). An uncertain object x is a random variable having domain \(\mathbb {D}\) with associated probability density function fx, where fx(v) denotes the density of x in v. We note that a certain object v can be regarded as an uncertain one whose associated pdf fv is δv(u), where δv(u) = δ(0), for u = v, and δv(u) = 0, otherwise, with δ(u) denoting the Dirac delta function.

Given a set S = {x1,…,xN} of uncertain objects, an outcome IS of S is a set {v1,…,vN} of certain objects such that \(f^{x_{i}}(v_{i})>0\) (1 ≤ iN). The pdf fS associated with S is

$$ f^{S}(v_{1},\ldots,v_{N}) = \prod\limits_{i=1}^{N} f^{x_{i}}(v_{i}). $$

Given two uncertain objects x and y, dist(x,y) denotes the continuous random variable representing the distance between x and y.

In the following we assume that with each object x it is given a finite region SUP(x) such that Pr(x∉SUP(x)) ≤ ω for a specific threshold ω. For example, SUP could be defined as an hyper-ball or an hyper-rectangle (e.g. the minimum bounding rectangle or MBR). If x has finite support, the threshold ω can be always set to 0. Note that under the above assumption the error involved in the calculation of the probability Pr(dist(x,y) ≤ R), with x and y two uncertain objects, is the square of ω.

The minimum distance mindist(x,y) between uncertain objects x and y is defined as \({\min \limits } \{ \text {dist}(u,v) : u\in \text {SUP}(x) \text { and } v\in \text {SUP}(y) \}\), while the maximum distance maxdist(x,y) between x and y is defined as \({\max \limits } \{ \text {dist}(u,v) : u\in \text {SUP}(x) \text { and } v\in \text {SUP}(y) \}\).

2.2 Uncertain outliers

Given an uncertain dataset DS, Dk(x,DS) (or Dk(x), for short) denotes the continuous random variable representing the distance between x and its k-th nearest neighbor in DS ∖{x}. Next we define the notion of outlier in an uncertain dataset. For the sake of brevity, in the sequel, we will refer to an outlier in an uncertain dataset as to an uncertain outlier.

Definition 1

Given an uncertain dataset DS, an uncertain distance-based outlier in DS according to parameters k, R and δ ∈ (0,1) is an uncertain object x of DS such that the following relationship holds:

$$ Pr(D_{k}(x,{\textbf{DS}})\le R) \le 1 - \delta. $$

That is to say, an uncertain distance-based outlier is a dataset object for which the probability of having k dataset objects besides itself within distance R is smaller than 1 − δ.

Let N be the number of objects in DS. In order to determine the probability Dk(x), the following multi-dimensional integral has to be computed, where \({\textbf {DS}}^{\prime }\) denotes the uncertain dataset DS ∖{x} and \(I_{{\textbf {DS}}^{\prime }}\) a generic outcome of \({\textbf {DS}}^{\prime }\) (see also Section 2.1):

$$ {\int}_{\mathbb{D}^{N}} f^{x}(v) \cdot f^{{\textbf{DS}}^{\prime}}(I_{{\textbf{DS}}^{\prime}}) \cdot \textbf{I}[D_{k}(v,I_{{\textbf{DS}}^{\prime}}) \le R] \ \mathrm{d} I_{{\textbf{DS}}^{\prime}} \ \mathrm{d} v, $$
(1)

where the function I(⋅) outputs 1 if the probability of its argument is 1, and 0 otherwise. According to the above formulation, deciding if an object is an uncertain distance-based outlier requires to compute an integral involving all the outcomes of the dataset.

3 Related work

There exist several approaches to detect outliers in the certain setting, namely statistical-based (Davies and Gather 1993; Barnett and Lewis 1994), deviation-based (Arning et al. 1996), distance-based (Knorr et al. 2000), density-based (Breunig et al. 2000; Papadimitriou et al. 2003), reverse nearest-neighbor-based (Angiulli 2020), isolation-based (Liu et al. 2012), subspace-based (Knorr and Ng 1999; Aggarwal and Yu 2001a; Angiulli et al. 2009, 2013), knowledge-based (Angiulli and Fassetti 2014), neural network-based (Hawkins et al. 2002), support vector machine-based (Tax and Duin 2004), and many others (Chandola et al. 2009; Aggarwal 2016). Among these approaches, distance-based outlier detection methods have been shown to be effective in various scenarios (Knorr et al. 2000; Bay and Schwabacher 2003; Ghoting et al. 2006; Tao et al. 2006; Angiulli and Fassetti 2009). However, none of these techniques is designed to handle uncertain data and, as far as the uncertain setting is concerned, only a few approaches have been proposed (Aggarwal and Yu 2008; Wang et al. 2009; Jiang and Pei 2011).

The method described in Aggarwal and Yu (2008) is a density based approach designed for uncertain objects which aims at selecting outliers in subspaces. The idea of the method is to approximate the density of the dataset by means of kernel density estimation and then to declare an uncertain object as an outlier if there exists a subspace such that the probability that the object lies in a sufficiently dense region is negligible. Differently from our approach, in Aggarwal and Yu (2008) the density estimate does not take directly into account the form of the pdfs associated with uncertain objects, since it is performed by using equi-bandwidth Gaussian kernels centered in the means of the object distributions. Pdfs are then taken into account to determine the objects lying in regions of low density, where the density is computed as before mentioned. Furthermore, since the method is interested in exploring subspaces (we recall that our goal is to detect outliers in the full feature space), pdfs are always expressed as the product of d independent one-dimensional pdfs, where d is the dimension of the space, while we are able to manage arbitrarily shaped multidimensional density functions.

In Wang et al. (2009) authors present a distance-based approach to detect outliers which adopts a completely different model of uncertainty than our, that is the existential uncertainty model, according to which an uncertain object x assumes a specific value vx with a fixed probability px and does not exist with probability 1 − px. According to this approach, uncertain objects are not modeled by means of distribution functions, but rather are deterministic values that may either occur or not occur in an outcome of the dataset. Hence, although (Wang et al. 2009) deals with distance-based outliers, their scenario is completely different from our, and the two methods are not comparable at all.

In Jiang and Pei (2011) an uncertain object consists of a pair (l,r), where l is a tuple on a set of conditioning attributes and r is a set of tuples on a set of dependent attributes, also called instances. To each instance rjr a measure of normality is assigned, consisting in the probability of observing rj given that both r and l have been observed. The normality of an object is then obtained as the geometric mean of the normality of all its instances. Authors exploits kernel density estimation and Bayesian inference to solve their problem. Outlier instances are detected by comparing against normal ones. Outlier objects are then detected as those objects most of whose instances are abnormal. We notice that the approach presented in Jiang and Pei (2011) essentially aims at detecting the abnormal instances, that, loosely speaking, are the abnormal outcomes of the uncertain objects. Thus, the task on interest in Jiang and Pei (2011) is not comparable to that considered here. Moreover, uncertain objects are modeled in a way which is completely different from that considered here.

The work (Liu et al. 2013) describes a SVDD-based outlier detection technique on uncertain data. The approach assigns a confidence score to each example, which indicates the likelihood of an example tending normal class, and then incorporates these confidence scores into the SVDD training phase for outlier detection. Hence, the technique does not directly manage uncertain objects, but rather attempts to mitigate possible error measurements by reducing the contribution on the construction of the decision boundary of the examples with the least confidence score.

4 Outlier probability

In this section we show how the value of Pr(Dk(x) ≤ R) can be computed, for x a generic uncertain object of DS. Given a certain object v and an uncertain object y, let \({p_{v}^{y}}(R) = Pr(\text {dist}(v,y) \leq R)\) denote the cumulative density function representing the relative likelihood for the distance between objects v and y to assume value less or equal than R, that is

$$ {p_{v}^{y}}(R) = Pr(\text{dist}(v,y) \le R) = {\int}_{{\mathcal B}_{R}(v)} f^{y}(u) \ \mathrm{d}u, $$
(2)

where \({\mathcal B}_{R}(v)\) denotes the hyper-ball having radius R and centered in v.

Let v be an outcome of the uncertain object x. For k ≥ 1, the probability Pr(Dk(v,DS ∖{x}) ≤ R) that v has at least k other dataset objects within distance R can be expressed as:

$$ 1 - \left( \underset{S\subseteq {\textbf{DS}}:|S|< k}{\sum} \left( \underset{z\in S}{\prod} {p_{v}^{z}}(R) \cdot \underset{z\in {\textbf{DS}}\setminus S}{\prod} (1 - {p_{v}^{z}}(R)) \right) \right), $$
(3)

that is one minus the probability that less than k dataset objects lie within distance R from v. Thus,

$$ Pr(D_{k}(x)\le R) = {\int}_{\mathbb{D}} f^{x}(v) \cdot Pr(D_{k}(v,{\textbf{DS}}\setminus\{x\}) \le R) \ \mathrm{d}v, $$
(4)

that is to say, loosely speaking, the summation over all the outcomes v of x of the occurrence probability of v multiplied by the probability that v has at least k objects within distance R over all the outcomes of the remaining dataset objects.

The subsequent section describes the algorithm UDBOD, whose aim is to quickly detect the dataset objects for which the right hand side of (4) is smaller than the provided probability threshold 1 − δ. Next it is discussed how to compute \({p_{v}^{y}}(R)\).

4.1 Computing the probability \({p_{v}^{y}}(R)\)

Probability values \({p_{v}^{y}}(R)\) depend on the objects v and y, and on the real value R and involve the computation of one integral with domain of integration \(\mathbb {D}\) (more precisely, the hyper-ball \({\mathcal B}_{R}(v)\)). It is known (Lepage 1978) that given a function g, if m points w1, w2, …, wm are randomly selected according to a given pdf f, then the following approximation holds:

$$ \int g(u) \mathrm{d}u \approx \frac{1}{m} \sum\limits_{i=1}^{m} \frac{g(w_{i})}{f(w_{i})}. $$
(5)

Thus, in order to compute the value \({p_{v}^{y}}(R)\) reported in (2), the function \({g_{v}^{y}}(u)\) such that \({g_{v}^{y}}(u) = f^{y}(u)\) if dist(v,u) ≤ R, and \({g_{v}^{y}}(u) = 0\) otherwise, can be integrated by evaluating the formula in (5) with m points wi randomly selected according to the pdf fy. This procedure reduces to computing the relative number of sample points wi lying at distance not greater than R from v, that is

$$ {p_{v}^{y}}(R) = \frac{|\{ w_{i} : \text{dist}(v,w_{i}) \le R \}|}{m}. $$
(6)

5 Uncertain distance-based outlier detector

In this section we describe the algorithm UDBOD (for Uncertain Distance-Based Outlier Detector) that mines the distance-based outliers in an uncertain dataset DS consisting of N objects.

Definition 1 makes use of three parameters, that are k (or, equivalently, ϱ ∈ (0,1), by setting k = ϱN), R, and δ. We point out that these parameters can be held fixed to the default values in order to perform a meaningful analysis, as experimental results show that outlier detection is little sensitive to the values of user-specific parameters. Specifically, according to the statistical and distance-based outlier detection literature (Knorr et al. 2000; Angiulli and Fassetti 2009), meaningful values for the parameter ϱ are in the range (0,2‰] (the value 1‰ is employed by default), while δ being a threshold level can be conveniently set in the range [0.8,0.9] (the value 0.9 is employed by default). As for the value of the parameter R, it will be automatically determined by UDBOD once the percentage α of outliers to detect has been specified. The value α is much more easy to determine than R and can be conveniently set to the 3‰ (Angiulli and Fassetti 2009).

Other than the above external parameters, the method requires some internal parameters, described in the sequel of the section, that do not require to be set by the user, since their optimal values are automatically determined from the external ones. Table 1 summarizes some of the symbols employed in this section, and meaningful ranges and recommended values for the parameters.

Table 1 Symbols employed in Section 5.1

The pseudo-code of UDBOD is reported in Algorithm 1. It consists of three phases: parameter estimation, candidate selection, and candidate filtering.

figure a

5.1 Parameter estimation phase

The Parameter estimation phase determines the right value R for the outlier radius R as a function of ϱ and α. Note that the effectiveness of the uncertain distance-based definition relies on the right selection of the radius value. Setting a meaningful value for the parameter R is a difficult task since its right value heavily depends on the characteristics of the input data. In particular, we will map the problem of setting R to the problem of setting a parameter β ∈ [0,1], by means of which the expected fraction of outliers can be controlled in a very simple and meaningful way. Indeed, as made clearer next, for β = 0 (β = 1, resp.) we have the statistical guarantee that a subset (superset, resp.) of the actual outliers is retrieved. In order to provide the above statistical guarantees, the meaningfulness of the outlier radius is related to the number of outliers estimated by means of a sampling procedure. The following definition is preliminarily needed.

Definition 2

Given two uncertain objects x and y, and a value β ∈ [0,1], also called mass factor, let distβ(x,y) denote the distance value:

$$ \text{dist}^{\beta}(x,y) = \beta\cdot maxdist(x,y) + (1-\beta)\cdot mindist(x,y). $$

Note that for β = 1 the distance distβ(x,y) coincides with maxdist(x,y) and that for β = 0 the distance distβ(x,y) coincides with mindist(x,y), while for β ∈ (0,1), dist(x,y) assumes an intermediate value.

Let \(D_{k}^{\beta }(x,{\textbf {DS}})\) (or \(D_{k}^{\beta }(x)\), whenever the dataset DS is clear from the context) denote the k-th nearest neighbor distance in DS ∖{x} according to distβ.

Let α denote the percentage of outliers to be detected. Then, once the parameter k = ⌈ϱN⌉ has been fixed, the value R for the parameter R such that the α percent of the dataset objects has less than k objects at distance distβ less than R can be estimated by means of the method reported in Algorithm 2.

figure b

In order the above method to be effective, a meaningful value for the sample size s must be employed. Now, it is shown how to set the size s of the sample in order to have a statistical guarantee that the actual percentage \(\widehat {\alpha }\) of objects in the whole dataset DS having \(D^{\beta }_{k}\) greater than the R is close to α.

With this aim, the following relation must hold

$$ {Pr}(|\widehat{\alpha}-\alpha|\le\epsilon)>1-\lambda, $$
(7)

asserting that the probability that the estimation error, that is the difference between \(\widehat {\alpha }\) and α, is lower than an error threshold 𝜖, is greater than 1 − λ. Clear enough, the lower 𝜖 and λ, the closer \(\widehat {\alpha }\) to α. By the Central Limit theorem, if the sample size s is large enough, then the following relationship holds:

$$ Pr\left( |\widehat{\alpha}-\alpha|\le\epsilon\right)\approx 2 \cdot {\Phi}\left( \frac{\epsilon\sqrt{s}}{\sqrt{\alpha(1-\alpha)}}\right) - 1. $$

Hence, the relation in (7) is satisfied if

$$ s>\frac{\alpha(1-\alpha)}{\epsilon^{2}}\left( {\Phi}^{-1}\left( 1-\frac{\lambda}{2}\right)\right)^{2}. $$
(8)

For example, let α = 3‰, and let 𝜖 = 2‰ and λ = 0.2, so that the number of uncertain outliers is between the 1‰ and the 5‰ with probability 0.8. By using (8) the sample size is s = 1, 228. Now, we prove that the radius R returned by the Parameter estimation phase is meaningful for the Definition 1. First, some properties of \(D^{\beta }_{k}\) are introduced.

Property 1

Let x be an uncertain object for which \({D^{1}_{k}}(x)\) is less or equal than R. Then x is not an outlier.

Indeed, if the condition of the statement is true, then each outcome of x has at least k neighbors within radius R in every outcome of the dataset.

Property 2

Let x be an uncertain object for which \({D^{0}_{k}}(x)\) is greater than R. Then x is an outlier.

Indeed, if the condition of the statement is true, then each outcome of x has less than k neighbors within radius R in every outcome of the dataset. Thus, given radius \(R^{\prime }\), it follows from Proposition 1 that the uncertain objects x of DS satisfying \({D^{1}_{k}}(x) > R^{\prime }\) are a superset of the outliers in DS for \(R=R^{\prime }\). Moreover, it follows from Proposition 2 that the uncertain objects x of DS satisfying \({D^{0}_{k}}(x)>R^{\prime }\) are a subset of the outliers in DS for \(R=R^{\prime }\).

Theorem 1

Let R1 (R0, resp.) be the smallest radius such exactly αN dataset objects x satisfy the condition \({D^{1}_{k}}(x) > R_{1}\) (\({D^{0}_{k}}(x) > R_{0}\), resp.), and let n1 (n0, resp.) the actual number of uncertain distance-based outliers in DS for R = R1 (R = R0, resp.). Then, the expected number n = αN of outliers in DS is lower bounded by n1 (upper bounded by n0, resp.), that is n1nn0.

Proof

As already pointed out, the αN objects x satisfying condition \({D^{1}_{k}}(x) > R_{1}\) are a superset of the actual number n1 of outliers for R = R1 and, consequently, n1αN = n. Moreover, the αN objects x satisfying condition \({D^{0}_{k}}(x) > R_{0}\) are a subset of the actual number n0 of outliers for R = R0 and, consequently, n0αN = n. □

This makes clear the motivation underlying the introduction of the parameter β: by properly tuning the value of β the actual number of outliers (and also of candidate outliers; see in the following) can be controlled in a very simple way. As for the value to assign to β, in the section devoted to experimental results it will be shown that β = 0.5 is a good option.

If at the expected outlier level α there is not a clear separation between the radius associated with outliers and that associated with inliers, then it can be concluded that there are less than αN true outliers in the dataset. So, in this case the fraction α should be lowered, for otherwise a considerable fraction of dataset objects would be recognized as outliers. This can be accomplished by properly lowering the radius R. In particular, Algorithm 2 guarantees that the computed radius R is at least four standard deviations far apart from the mean of the distribution of distances between sampled objects and their ⌈ϱs⌉-th nearest neighbor in the sample. Specifically, this estimation correction selects the smallest radius associated with objects in the sample which is not smaller than the above mentioned threshold.

5.2 Candidate selection phase

The Candidate selection phase fast determines the set OutCands of candidate outliers by exploiting a deterministic lower bound property based on the maxdist distance between uncertain objects. We start by recalling the definition of a distance-based outlier in the context of certain datasets (Knorr et al. 2000).

Definition 3

Given a dataset of objects on which is defined a distance dist, a positive integer k and a positive real number R, an object v is said to be a (certain) distance-based outlier according to parameters k and R, if less than k objects of DS lie within distance R from v.

The following result bridges the link between certain and uncertain distance-based outliers.

Theorem 2

For each δ, if x is an uncertain distance-based outliers of DS according to parameters k, R and δ then x is a certain distance-based outlier of DS for the distance maxdist according to parameters k and R.

Proof

We prove that if x is not a certain distance-based outliers of DS then x is not an uncertain distance-based outlier of DS. First, we notice that x is not a certain distance-based outlier according to parameters k and R if and only if the distance to its k-th nearest neighbor is smaller than R. Moreover, we recall that \({D_{k}^{1}}(x)\) denotes the distance from x and its k-th nearest neighbor according to the distance maxdist. The proof follows by Property 1. □

From the above theorem a suitable set OutCands of uncertain candidate outliers can be obtained by regarding DS as a set of certain objects equipped with the certain distance maxdist and by computing the certain distance-based outliers therein contained.

As an important property, next it is shown that if the employed distance function dist is a metric, then the maximum distance function maxdist induced on dist is a metric as well.

Theorem 3

Let dist be a metric. Then the maxdist function induced by the distance dist is a metric.

Proof

Four properties have to be proven: non-negativity, symmetry, identity of indiscernibles, and triangle inequality. The first two properties immediately follows from the fact that dist is a metric.

As for the identity of indiscernibles, assume that maxdist(x,y) = 0, then it is the case that for each realization u of \(\widehat {x}\) and v of \(\widehat {y}\) such that \(Pr[\widehat {x}=u \wedge \widehat {y}=v]>0\), dist(u,v) = 0. Hence, by the fact that dist is a metric, u = v, and x and y must be the same uncertain object. As for the reverse direction, since x and y are the same random variable, u and v are always identical.

As for triangle inequality, given three generic uncertain objects x, y, and z, the triangle inequality is satisfied, that is to say that: maxdist(x,z) + maxdist(z,y) ≥ maxdist(x,y). Let x1 and z1 (y2 and z2, resp.) the outcomes of the uncertain objects x and z (y and z, resp.) for which the relationship dist(x1,z1) = maxdist(x,z) (dist(z2,y2) = maxdist(z,y), resp.) is satisfied.

Let x0 and y0 be the outcomes of the uncertain objects x and y for which dist(x0,y0) = maxdist(x,y) holds. Assume that maxdist(x,y) > maxdist(x,z) + maxdist(z,y). Given an arbitrary outcome z0 of z, since dist is a metric by assumption, by the triangle inequality it holds that dist(x0,z0) + dist(z0,y0) ≥dist(x0,y0) = maxdist(x,y), and, by the above assumption, it finally holds that dist(x0,z0) + dist(z0,y0) > dist(x1,z1) + dist(z2,y2). But, this would contradict the definition of x1, z1, z2 and y2, since, by definition of maxdist, it is the case that dist(x1,z1) ≥dist(x0,z0) and dist(z2,y2) ≥dist(z0,y0) and, hence, that dist(x1,z1) + dist(z2,y2) ≥dist(x0,z0) + dist(z0,y0). Hence, the statement follows. □

Thus, even if the space obtained by using maxdist as a distance function is not Euclidean, it is anyway a metric one provided that dist is itself a metric (as it is the case when dist is the Euclidean distance). The above result has the important practical implication that the set OutCands can be determined by exploiting certain distance-based outlier detection algorithms designed to work in general metric spaces.

As a consequence, in step 2 the algorithm UDBOD employs the DOLPHIN technique (Angiulli and Fassetti 2009). DOLPHIN performs two sequential scans of the dataset. During the first scan, a superset of the true outliers is detected. by accumulating in a data structure, called INDEX, the incoming objects that cannot be recognized as outliers by exploiting the objects already stored in INDEX. The second scan is needed to recognize the true outliers in INDEX. The temporal cost is derived by proving that the size of INDEX is \(O(\frac {k}{p})\).

figure c

5.3 Candidate filtering phase

The Candidate filtering phase (see steps 3-9 in Algorithm 1) computes the set Outliers of uncertain outliers contained in the dataset by processing the objects in the set OutCands. In order to reduce the computational effort, a lower bound property is introduced and exploited, which avoids to consider all the potential neighbors of the candidate outliers in order to compute their outlier probability.

The objects x of OutCands such that \(D^0_k(x)> R\) can be safely inserted into Outliers since, as stated in Section 2, they are outliers for sure. We call these objects ready outliers.

As for the non-ready outliers x, it has to be decided whether Pr(Dk(x) ≤ R) ≤ 1 − δ or not, and this is accomplished by computing (4) exploiting the procedure explained in the following of this section and reported in Algorithm 3 (see lines 6-26). With this aim, consider the set DSx,R = {yDSmindist(x,y) ≤ R}, also called the neighbor list of x (in DS w.r.t. R).

The objects in the set DSx,R are all and only the uncertain objects of DS which give a contribution to the probability Pr(Dk(x) ≤ R), since for the objects zDSDSx,R it holds that Pr(dist(x,z) ≤ R) = 0.

Let w1,…,wm denote m outcomes of x, and let y1,…,y denote the uncertain objects in the set DSx,R ordered accordingly to an arbitrary criterion.

Let P(wh,i,j) denote the probability that the certain object wh has exactly i neighbors among the first j uncertain objects y1,…,yj of DSx,R

Moreover, let \(P_{k}^j(x)\) denote the probability that x has at least k neighbors within distance R among the first j uncertain objects of DSx,R, then by exploiting the approximation in (5):

$$ {P_{k}^{j}}(x) = \frac{1}{m}\sum\limits_{h=1}^{m} \left( 1 - \sum\limits_{i=0}^{k-1} P(w_{h},i,j) \right) $$
(9)

The following theorem holds.

Theorem 4

If there exists jk such that \(P_k^j(x)> 1 - \delta \) then x is not an outlier.

Proof

The proof follows by noticing that, for each j ∈ {1,2,…,|DSx,R|}, it holds that \(P_k^j(x) \le Pr(D_k(x)\le R),\) that is to say that \(P_k^j(x)\) is a lower bound for the probability that x has exactly i neighbors in a generic outcome of DS. □

Consequently, if for some jk the left hand side term above exceeds 1 − δ, then the computation can be early stopped reporting that x is not an outlier.

Notice that \(P^{\ell }_k(x)\) is precisely Pr(Dk(x) ≤ R). Interestingly, in order to compute \(P^{\ell }_k(x)\) and its lower bounds \(P^j_k(x)\) (1 ≤ j) only space O(mk) is needed instead of O(mk), since the mk terms P(wh,i,j) can be computed by means of the incremental procedure described next. Let pj be Pr(dist(wh,yj) ≤ R), then the following relationship is satisfied:

$$P(w_{h},i,j) = p_{j}\cdot P(w_{h},i-1,j-1) + (1-p_{j})\cdot P(w_{h},i,j-1),$$

that is to say, the probability that the certain object wh has exactly i neighbors among the first j uncertain objects y1,…,yj is equal to (i) the probability pj that yj is a neighbor of wh and wh has exactly i − 1 neighbors among the uncertain objects y1,…,yj− 1, plus (ii) the probability 1 − pj that yj is not a neighbor of wh and wh has exactly i neighbors among the uncertain objects y1,…,yj− 1. By the above relationship it is clear that in order to compute the terms P(wh,⋅,j) only the terms P(wh,⋅,j − 1) are needed, that are k terms for each of the m outcomes wh of x.

The Candidate Filtering Phase, reported in Algorithm 3, details the procedure to compute the lower bound \(P^j_k(x)\) (there, the variable LB is used to accumulate the value of the lower bound, while the matrix elements P[h,i] to store the values P(wh,i,⋅)).

The above procedure does not depend on the order y1,…,y of the objects in DSx,R, but considering first the objects closest to x may help to accelerate convergence of the lower bound. With this aim, uncertain objects in the set DSx,R are sorted in ascending order of their score s(yj) defined as:

$$s(y_{j}) = \frac{maxdist(x,y_{j})-R}{maxdist(x,y_{j})-mindist(x,y_{j})},$$

for maxdist(x,yj) > R, and s(yj) = 0 for maxdist(x,yj) ≤ R. The score s(yj) ranges in [0,1].

5.4 Temporal cost

Let d denote the cost of computing the distance dist between two certain objects of \(\mathbb {D}\) and also the distances maxdist and mindist between two uncertain objects of \(\mathbb {D}\). Let c denote the number of outlier candidates, let m denote the number of samples employed to evaluate integrals by means of the formula in (6), and let \(\overline {\ell }\) denote the mean number of elements in the neighbor lists DSx,R employed to compute the outlier probability, for x a candidate outlier.

The parameter selection phase costs O(s2d), where sN is the size of the sample employed to estimate R, size that can be considered fixed. The candidate selection phase costs \(O(\frac {k}{p} N d)\), where p ∈ (0,1] is an intrinsic parameter of the dataset at hand (Angiulli and Fassetti 2009). As for the candidate filtering phase, for each outcome wh of x (1 ≤ hm) and for each yjDSx,R (\(1\le j\le \overline {\ell }\)), computing Pr(dist(wh,yj) ≤ R), with yjDSv, costs O(md), while obtaining the terms P(wh,⋅,j) costs O(k). Thus, deciding for Pr(Dk(x) ≤ R) ≤ 1 − δ costs in the worst case \(O(\overline {\ell } m (m d + k))\). As a whole, the candidate filtering phase costs \(O(c \overline {\ell } m (m d + k))\).

Thus, the cost of the algorithm is \(O\big (s^2 d + \frac {k}{p}N d + c\overline {\ell }m(md+k) \big )\). The last phase of the algorithm is the potentially heaviest one, since it involves integral calculations. To be practical, the algorithm must be able to select a number of outlier candidates c close to the value αN of expected outliers (α ∈ [0,1]) and possibly to keep as lower as possible the value of \(\overline {\ell }\).

6 Experimental results

In this section, we describe experimental results carried out by using the UDBOD algorithm. If not otherwise stated, we use the default values for parameters in Table 1 and m = 1, 000. The experiments are conducted on a Intel Xeon 2.33 GHz based machine with 4GB of RAM under the GNU/Linux operating system. Each dataset is characterized by a parameter γ, called spread, used to set the degree of uncertainty associated with dataset objects.

Experiments are organized as follows. Section 6.1 studies the scalability of the method. Section 6.2 studies how parameters influence the number of candidate and ready outliers. Section 6.3 compares the proposed method with related literature. Finally, Section 6.4 presents two cases of study.

6.1 Scalability analysis

We considered a family of synthetic data whose elements differ for the number N of uncertain objects and the number D of attributes, generated according to the following strategy. The uncertain objects in each dataset form two normally distributed separated clusters with mean \((-10, 0, \dots , 0)\) and \((10, 0, \dots , 0)\), respectively. Moreover, the 3‰ of the dataset objects are uniformly distributed in a region lying on the hyper-plane x = 0 (that is to say, their first coordinate is always zero). Uncertain objects are randomly generated and may use a normal, an exponential or a uniform distribution whose spread is related to the standard deviation of the overall data by means of the parameter γ ∈ {0.02,0.05,0.1}.

Figure 1 on the left shows the scalability with respect to the number N of objects. In this experiment, N has been varied between 10, 000 and 1, 000, 000, while the number of dimensions D has been held fixed to 3. These curves show that the method has very good performances for different values of spread. In particular, the execution time is below 1, 000 seconds even for one million of objects, confirming that the method is able to manage large datasets.

Fig. 1
figure 1

Scalability with respect to the dataset size and the number of dimensions for the Synthetic dataset

Figure 1 on the right shows the scalability with respect to the number of dimensions D. This time the number of objects has been held fixed to 10, 000. Also in this case, time performances are good. The execution time clearly increases with the dimensionality, due to the increasing cost of evaluating outcomes of the distributions, but in these experiments it remained below 100 seconds even for 10-dimensional datasets.

We studied also the accuracy. Figure 2 reports the F-score as a function of the radius R. It is assumed that the outliers are the objects lying on the hyperplane x = 0. The curves highlight the accuracy of the approach. Indeed, for values of radius above 1.5 the F-score is close to 1 for every considered spread, and for spread equal to 0.02 and 0.05 the F-score is almost always above 0.9 for every radii considered.

Fig. 2
figure 2

Accuracy of the Synthetic dataset family

For the highest spread and the lowest radius considered, the F-score lowers. This situation can be understood by considering Table 2 which reports the number of outliers returned by the method. It can be seen that for spread equal to 0.1 and radius set to 1.0, the number of outliers returned by the method is notably larger than the actual number of outliers. All the objects lying on the hyperplane x = 0 are correctly retrieved, but the method start to consider as outliers the objects lying in the tails of the distributions associated with the clusters.

Table 2 Outliers detected for the Synthetic dataset

6.2 Sensitivity analysis

In this section, we study how parameters influence performances, that is the number of candidate outliers and of ready outliers. We employed the following datasets: Cities (N = 5, 922, d = 2), Household (N = 2, 075, 259, d = 7) Skin (N = 245, 057, d = 3), and US Points (N = 15, 206, d = 2). Cities, containing 5, 922 city and village locations in Greece, and US Point, containing 15, 206 points of populated places in USA, are from the R-Tree Portal.Footnote 1Household and Skin, are from the UCI ML Repository.Footnote 2Household contains 2, 075, 259 measurements of electric power consumption. Skin is collected by randomly sampling 245, 057 RGB values from face images of various age groups, race groups, and genders.

For all datasets, a family of uncertain datasets has been obtained as follows. An uncertain object xi has been associated with each certain object vi in the original dataset, whose pdf \(f^{x_i}(u)\) is a multidimensional normal, uniform or exponential randomly selected distribution centered in xi and whose spread is related to the standard deviation of the overall data by means of the parameter γ. Different values for the parameter β and for the spread γ (specifically, γ ∈ {0.05,0.1}) have been taken into account.

Figure 3 reports the number of candidate outliers detected at the end of the candidate selection phase (gray bar, on the left), the actual number of outliers detected (green bar, on the middle), and the number of non-ready candidates (red bar, on the right). Specifically, the non-ready candidates are the objects for which (4) has to be evaluated. Notice that in almost all of the runs the number of candidate objects represents a small fraction of the overall dataset size, in the worst case amounting to the 0.65% (when γ = 0.05) and the 1.67% (when γ = 0.10) for Cities, the 0.32% for Household, the 0.38% (when γ = 0.05) and the 0.45% (when γ = 0.10) for Cities, and the 1.21% (when γ = 0.05) and the 5.97% (when γ = 0.10) for Cities. This confirms that the candidate selection phase allows to save a vast amount of time.

Fig. 3
figure 3

Number of candidates (gray bars, on the left), outliers (green bars, on the middle), and non-ready candidates (red bars, on the right). The dashed line represents the number αN of expected outliers

The dashed line represents the number αN, with α = 3‰. From the figure it is clear the effect of the parameter β on the efficiency of the method (number of candidates) and on the number of actual outliers. It appears there is a trade-off between these two numbers that can be controlled by means of β. As far as the correspondence between the number of actual outliers and the number αN of expected ones, according to Theorem 1 the number of outliers for β = 0 (β = 1, resp.) should be greater (lower, resp.) than the expected αN. Clearly, this is true modulo (i) the error introduced by the radius estimation and (ii) the introduction of the correction to the estimation. Specifically, the above relationship is satisfied for Cities, US Points and Household with γ = 0.05, and for Cities and US Points with γ = 0.10. As for the other cases, the number of actual outliers is always smaller than the expected one, since the correction of the estimation has been employed. This is also confirmed by the fact that the number of actual outliers is almost the same for the different values of β. Thus, in these cases there are less than αN true outliers and the parameter estimation phase is able to determine the right radius. Thus, the above experiment highlights that the parameter estimation phase allows to determine values for the parameters complying with the required number of outliers without exceeding the number of clearly non-outlying objects.

As for the number of candidates, it is about inversely proportional to the value of β. So, in order to reduce the computational effort it is better to employ β values greater than zero. As for values of β close to one, the actual number of outliers could result sensibly smaller than the αN fraction, so it is better to employ β values smaller than one. Intermediate values for β (around 0.5) seem a good trade-off between the number of candidates and the number of actual outliers. Indeed, β = 0 could result in a lot of candidate outliers (e.g., for US Points and γ = 0.1 the number of candidates is more than 16 times greater than the number of outliers), while β = 1 could result in too few outliers (e.g., for Cities and γ = 0.1 the number of outliers is about nine times smaller than the expected one).

Figure 4 shows the size of the neighbor list associated with rejected candidates (green bar, on the left), namely the non-ready candidates which are inliers, and number of neighbors considered until early stop is reached (red bar, on the right). The dashed line represents the value of the parameter k. The figures show that the candidate filtering phase is able to recognize the inliers without the need to take into account all the objects in the neighbor list (whose average number corresponds to the blue bars in Fig. 4). In particular, how witnessed by red bars (on the right), the number of neighbors actually considered in (9) is close to k. Notice that at least k neighbors have to be considered in order to prove the inlierness of an object. Thus the candidate filtering phase allows to maintain very low the computational effort to be paid on candidate objects.

Fig. 4
figure 4

Rejected candidates: size of the neighbor list (green bars, on the left) and number of neighbors considered until early stop (red bars, on the right). The dashed line represents the value of the parameter k

Figure 5 shows the elapsed time at the end of the parameter estimation phase (dotted line), the candidate selection phase (dashed line), and the candidate filtering phase (solid line). The plots confirm that the bulk of the computation is given by the last phase.

Fig. 5
figure 5

Execution time: elapsed time at the end of the parameter estimation phase (dotted line), at the end of the candidate selection phase (dashed line), and at the end of the candidate filtering phase (solid line)

6.3 Comparison with other methods

We compared UDBOD with the DensitySamp technique introduced in Aggarwal and Yu (2008) and the Determistic technique introduced in Aggarwal and Yu (2001b). The technique (Aggarwal and Yu 2008) is designed for uncertain data and described in Section 3. The technique (Aggarwal and Yu 2001b) does not manage uncertainty, but determines outliers by finding projections of the data which have abnormally low density, and was already used as a baseline competitor in Aggarwal and Yu (2008). Deterministic determines outliers by finding projections of the data which have abnormally low density. In the comparison we employed a family of datasets described in Aggarwal and Yu (2008), whose characteristics are recalled next. The data points were generated by creating Gaussian clusters in the underlying data. whose centers were generated uniformly in the unit data cube. The number of data points in each cluster was proportional to a random variable drawn from a uniform distribution in [0,1]. The radius along each dimension was drawn from a uniform distribution in [0,r]. A fraction p of the data points were designated as outliers. The outliers were generated anywhere in the data cube. A total of N data points were generated in d dimensions. All datasets were normalized, so that the standard deviation along each dimension was 1 unit. Each uncertain attribute is normally distributed with zero mean and standard deviation drawn from a uniform distribution in [0,2 ⋅ f] ⋅ σ, where σ is the standard deviation of that dimension in the underlying data. The dataset is denoted by R(r).O(p).d(d).D(N).U(f).

Since the outliers were known, the precision and recall could be measured. In the case of UDBOD, the trade-off between the precision and recall is measured by varying the radius R. As for the two other algorithms, we varied their parameters and applied Deterministic to the above datasets as described in Aggarwal and Yu (2008). Figure 6 reports the results of the comparison. According to Aggarwal and Yu (2008), we employed the following values for the parameters: r = 0.3, d = 10, p ∈ {0.1, 0.2}, N = 100K, and f ∈ {1.0, 1.5, 2.0, 2.5, 3.0}. Specifically, the two plots on the top report the Precision and the Recall of the methods for different outlier fractions, namely p = 0.1 and p = 0.2, and uncertainty level f = 1.5. As for the two plots on the bottom, the F-score obtained by the methods for the same outlier fractions p and uncertainty levels f ranging in [1.0, 3.0].

Fig. 6
figure 6

Comparison with DensitySamp and Deterministic

6.4 Cases of study

Handwritten digits.

MNIST is an high-dimensional dataset of handwritten digits represented as images of 28 × 28 pixels extensively employed in the literatureFootnote 3. We simulated an uncertain scenario in which digits are blurred, by associating a normally distributed uncertain object oi with mean μi and standard deviation σi to each non-overlapping 2 × 2 tile of image pixels. The parameters μi and σi are obtained as the mean and the standard deviation of the intensities of the pixels within the corresponding tile. Thus, the dataset consists of 196-dimensional uncertain objects. We randomly selected 590 digits from the class “1” and 10 digits from the remaining classes to form a dataset of 600 uncertain objects.

Figure 7 shows the dataset objects (pixels intensities are those corresponding to μi values). Digits corresponding to the outliers have been highlighted by complementing their intensity values (so that they appear on a dark background). Outliers are computed for k = 5 and for the radius value determined by the algorithm with α = 0.02 (corresponding to about 12 objects) and β = 0.5. It can be seen that eight out of the ten non-“1” digits have been detected. The only exception is represented by a “8” digit with markedly uncertain borders and a largely distorted “9” digit. As for remaining outliers, they correspond to “1” digits that are not usual within the collection.

Fig. 7
figure 7

Uncertain MNIST and detected outliers

The number of candidates returned by the candidate selection phase was 110. This number witnesses the difficulty of the problem, since it follows from the fact that the support of the objects are largely overlapping. Despite this number, during candidate filtering the mean number of neighbors considered until early stop was only 7.4. By using m = 100, the execution time of UDBOD was about 104.4 seconds (11.7 secs for parameter estimation, 5 secs for candidate selection, and 87.7 secs for candidate filtering).

Mobile ad-hoc network data.

A Mobile Ad hoc NETworks (MANET) (Bai and Helmy 2006) is a collection of wireless mobile nodes forming a self-configuring network. Applications include mobile classrooms, battlefield communication, disaster relief, and others. The mobility model of a MANET is designed to describe the movement pattern of mobile users, and how their location, velocity and acceleration change over time. A popular mobility model is the Random Waypoint model (Bettstetter et al. 2004), in which nodes move independently within a certain area, called support area. For a squared support area of size a by a, with a its diameter, centered in (x0,y0), the pdf of the random waypoint model is provided by the following analytical expression:

$$ f_{\text{rw}}(x,y) \approx \frac{36}{a^{6}} \cdot \left( (x - x_{0})^{2} - \frac{a^{2}}{4} \right) \cdot \left( (y - y_{0})^{2} - \frac{a^{2}}{4} \right), $$

for \(x\in \left [x_0-\frac {a}{2},x_0+\frac {a}{2}\right ]\) and \(y\in \left [y_0-\frac {a}{2},y_0+\frac {a}{2}\right ]\), and frw(x,y) = 0 outside.

The nodes of a MANET are typically distinguished by their limited power, processing, and memory resources. Multiple hops are usually needed for a node to exchange information with any other node and nodes take advantage of their neighbors in order to communicate with the rest of the network. A node can correctly receive packets if the signal strength of the packet at that node is above a certain threshold and the needed transmission power is inversely proportional to the squared distance separating the transmitter to the receiver.

The dataset (see Fig. 8a) consists of 250 MANET nodes distributed along three different paths joining two locations. Each red square in the figure delimits the support area associated with a node (diameters of support areas range from the 2% to 6% of the simulation area side). Since, information exchange is accomplished by multiple hops involving neighbor nodes, the smaller the number of neighbors lying in the neighborhood of a node, the less reliable, in terms of QoS (Quality of Service), the region which the node belongs to. Thus, we exploit uncertain distance-based outlier detection to determine the less reliable regions of the simulation area. With this aim, we fixed the radius R to the 7% of the simulation area side (a circular region of radius R is highlighted in 8a), a value corresponding to a predefined level of transmission power due to device constraints.

Fig. 8
figure 8

MANET dataset: a nodes distributed along 3 paths; b QoS associated with locations (colors range from red, for higher QoS values, to blue); uncertain outliers (blue asterisks) for k = 3 c and k = 10 d

Since, the QoS can be related to the number of neighbors, we detected the uncertain distance-based outliers for increasing values of k. Figures 8c and d show the outliers for k = 3 and k = 10, respectively. The outliers for k = 3 are positioned along the central path, which corresponds to the lowest populated region of the area, while the additional outliers for k = 10 are located along the path on the right, which corresponds to the mild populated region of the area. As for remaining objects, they are located along the path on the left, which corresponds to the most reliable route between the two extrema. As for Fig. 8b, it provides a picture of the QoS associated with each location of the area, since the color of each point (colors range from blue, for k = 1, to red, for k = 35) is proportional to the smallest value of k for which the location, regarded as an uncertain object, becomes an outlier.

7 Conclusions

A novel definition of uncertain outlier has been introduced dealing with multidimensional arbitrary shaped pdfs and representing the generalization of the classic distance-based outlier definition. Our approach corresponds to perform a nearest neighbor density estimate on all the possible outcomes of the dataset and, to the best of our knowledge, has no counterpart in the literature. Possible future research directions include techniques for alleviating the cost involved with the computation of integrals, possibly based on exploiting data indexing techniques, and of alternative notions of uncertain outlier, as ones inspired to adaptive density estimation strategies, or by considering more involving scenarios including time-varying distributions.