A Gradient-Based Method for Robust Sensor Selection in Hypothesis Testing

Ma, Ting; Qian, Bo; Niu, Dunbiao; Song, Enbin; Shi, Qingjiang

doi:10.3390/s20030697

Open AccessArticle

A Gradient-Based Method for Robust Sensor Selection in Hypothesis Testing

¹

College of Mathematics, Sichuan University, Chengdu 610064, China

²

School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China

³

School of Software Engineering, Tongji University, Shanghai 201804, China

^*

Author to whom correspondence should be addressed.

Sensors 2020, 20(3), 697; https://doi.org/10.3390/s20030697

Submission received: 20 December 2019 / Revised: 23 January 2020 / Accepted: 23 January 2020 / Published: 27 January 2020

(This article belongs to the Section Sensor Networks)

Download

Browse Figures

Versions Notes

Abstract

:

This paper considers the binary Gaussian distribution robust hypothesis testing under a Bayesian optimal criterion in the wireless sensor network (WSN). The distribution covariance matrix under each hypothesis is known, while the distribution mean vector under each hypothesis drifts in an ellipsoidal uncertainty set. Because of the limited bandwidth and energy, we aim at seeking a subset of p out of m sensors such that the best detection performance is achieved. In this setup, the minimax robust sensor selection problem is proposed to deal with the uncertainties of distribution means. Following a popular method, minimizing the maximum overall error probability with respect to the selection matrix can be approximated by maximizing the minimum Chernoff distance between the distributions of the selected measurements under null hypothesis and alternative hypothesis to be detected. Then, we utilize Danskin’s theorem to compute the gradient of the objective function of the converted maximization problem, and apply the orthogonal constraint-preserving gradient algorithm (OCPGA) to solve the relaxed maximization problem without 0/1 constraints. It is shown that the OCPGA can obtain a stationary point of the relaxed problem. Meanwhile, we provide the computational complexity of the OCPGA, which is much lower than that of the existing greedy algorithm. Finally, numerical simulations illustrate that, after the same projection and refinement phases, the OCPGA-based method can obtain better solutions than the greedy algorithm-based method but with up to

48.72 %

shorter runtimes. Particularly, for small-scale problems, the OCPGA -based method is able to attain the globally optimal solution.

Keywords:

wireless sensor network; robust sensor selection; hypothesis testing; Chernoff distance; Danskin’s theorem; orthogonal constraint-preserving gradient algorithm

1. Introduction

Wireless sensor networks (WSNs) are extensively used to collect and transmit data in many applications, such as autonomous driving [1], disaster detection [2], target tracking [3], etc. In the WSN, it is usually unaffordable to collect and process all sensor data due to the limitations of power and communication resources [4,5]. Therefore, it is of great significance to choose an optimal subset of sensors such that the best performance is attained only based on data collected by the selected sensors, which is the so-called sensor selection problem.

In the past dozen years, sensor selection has been widely studied in various fields, e.g., estimation [6], target tracking [7], condition monitoring [8], to name a few. For parameter estimation in Kalman filtering dynamic system, [6] chose the optimal subset of sensors in each iteration via minimizing the error covariance matrix of the next iteration. The sensor selection problem for target tracking in large sensor networks was addressed in [7] based on generalized information gain. In [8], it provided an entropy-based sensor selection method for condition monitoring and prognostics of aircraft engine, which can describe the information contained in the sensor data sets.

Meanwhile, the sensor selection problem in hypothesis testings has also attracted a lot of attention [9,10,11]. For this type of hypothesis testings, only part of sensors in WSN are activated to transmit observation data, and then decisions are made based on the measurements of selected sensors to achieve the best detection performance. When the optimal sensor selection matrix is fixed, the corresponding hypothesis testing problem is reduced to a common one, which is easy to be dealt with. Hence, it is crucial to solve the involved sensor selection problem.

Work [9] studied the sensor selection for the binary Gaussian distribution hypothesis testing in the Neyman–Pearson framework, where the true distribution under each hypothesis is exactly known. It approximately converted the minimization of the false alarm error probability to the maximization of the Kullback-Leibler (KL) divergence between the distributions of the selected measurements under null hypothesis and alternative hypothesis to be detected. Additionally, [9] proposed a sensor selection framework of first relaxation and then projection for the first time, and provided the greedy algorithm to solve the relaxed problem by optimizing each column vector of the selection matrix.

In practical applications, the events to be detected (i.e., parameters of the hypothesis testing) are usually estimated from training data and affected by some uncertainty factors, such as poorly observation environment and system errors. Then, these parameters are not known precisely, but assumed to lie in some given uncertainty sets [10,11]. In these scenarios, the minimax robust sensor selection problem is formulated to cope with the parameter uncertainty. For the binary Gaussian distribution hypothesis testing under the Neyman–Pearson framework, following the framework in [9], work [10] investigated the involved sensor selection problem with distribution mean under each hypothesis falling in an ellipsoidal uncertainty set (the distribution covariance is known). Furthermore, [11] considered the sensor selection problems involved in the Gaussian distribution robust hypothesis testings with both Neyman–Pearson and Bayesian optimal criteria, where the distribution mean under each hypothesis drifts in an ellipsoidal uncertainty set. For the Bayesian framework, minimizing the maximum overall error probability is approximately converted to maximizing the minimum Chernoff distance. Then the corresponding greedy algorithm together with projection and refinement is also proposed to solve the robust sensor selection problem in the above hypothesis testing.

It has been shown in [11] that the robust sensor selection problem in the hypothesis testing is NP-hard under the Bayesian framework. Nevertheless, when the size of the sensor selection problem is small, its optimal solution can be obtained by the exhaustive method via traversing all possible choices. However, for a large-scale problem, the exhaustive method is not affordable due to its huge computation complexity. Although the aforementioned greedy algorithm-based method (i.e., greedy algorithm, projection and refinement) admits a lower computation complexity than the exhaustive method, it can not arrive at the globally optimal solution in many cases, and its computation complexity is still high for large-scale problems. Therefore, it is significant to seek a more efficient algorithm for solving the robust sensor selection problem in the hypothesis testing of WSN. To our surprise, even though other general sensor selection problems have been continuously investigated, for instance, sparse sensing [12] and sensor selection in sequential hypothesis testing [13], there is little progress for this type of sensor selection problems since [11] was published in 2011, which motivates our research.

In this paper, we consider the same binary Gaussian distribution robust hypothesis testings under a Bayesian framework as in [11], where the distribution mean under each hypothesis lies in an ellipsoidal uncertainty set. We attempt to select an optimal subset of sensors such that the maximum overall error probability is minimized. Following the similar idea in [11], minimizing the maximum overall error probability is approximated by maximizing the minimum Chernoff distance between two distributions under null hypothesis and alternative hypothesis to be detected. Then, our main contributions can be summarized as follows.

First, we succeed in converting the maximinimization of the Chernoff distance to a maximization problem, and adopting the orthogonal constraint-preserving gradient algorithm (OCPGA) [14] to obtain a stationary point of the relaxed maximization problem without 0/1 constraints.
Specifically, when implementing the OCPGA to the relaxed maximization problem, we utilize the Danskin’s theorem [15] to acquire its gradient. Furthermore, the efficient bisection is applied to get the means of distributions under null hypothesis and alternative hypothesis to be detected corresponding to the minimum Chernoff distance.
The computational complexity of the OCPGA is shown to be lower than that of the greedy algorithm in [11] from the theoretical point of view, while numerical simulations show that the OCPGA-based method (i.e., OCPGA, projection and refinement) can obtain better solutions than the greedy algorithm-based method (i.e., R-C algorithm in [11]) with up to $48.72 %$ shorter runtimes. Therefore, better solutions are available for our proposed OCPGA-based method in some scenarios.

The remainder of this paper is organized as follows. Section 2 states the problem formulation. The proposed OCPGA, as well as projection and refinement phases, is characterized in Section 3. In Section 4, the existence and computation of gradient are provided. Section 5 presents some numerical experiments to corroborate our theoretical results, while Section 6 concludes the paper.

Notations: Denote

R^{m}

and

R^{m \times p}

as the m-dimensional real vector space and

m \times p

dimensional real matrix space, respectively. Let

I

and

0

be the identity matrix and the zero matrix whose dimensions will be clear from the context; bold-face lower-case letters are used for vectors, while bold-face upper-case letters are for matrices.

N (μ, S)

represents the Gaussian distribution with mean

μ

and covariance

S

. For matrix

A

,

tr (A)

,

| A |

,

{∥ A ∥}_{F}

,

A^{H}

and

A_{i, j}

denote its trace, determinant, Frobenius norm, conjugate transpose and the

(i, j)

-th entry, respectively. For square matrices

A

and

B

,

A ⪰ (≻) B

represents that

A - B

is semi-positive (positive) definite. For a semi-positive definite matrix

A

,

A^{\frac{1}{2}}

stands for its square-rooting matrix.

2. Problem Formulation

2.1. System Model

Define

x = {(x_{1}, x_{2}, \dots, x_{m})}^{T} \in R^{m}

as the observation vector of all m sensors (each sensor corresponds to one-dimensional measurement). Consider the same binary Gaussian distribution robust hypothesis testing as in [11]:

\begin{matrix} H_{0} : x \sim N (m_{0}, S_{0}), H_{1} : x \sim N (m_{1}, S_{1}), \end{matrix}

(1)

where mean vector

m_{i}

falls in a given ellipsoidal uncertainty set

E ({\bar{m}}_{i}, k_{i} S_{i}^{- 1}) = {x \in R^{m} | {(x - {\bar{m}}_{i})}^{T} {S_{i}}^{- 1} (x - {\bar{m}}_{i}) ⩽ \frac{1}{k_{i}}}

,

{\bar{m}}_{i}

denotes the mean estimated by training data, covariance

S_{i}

is a known matrix, and

k_{i} \in (0, + \infty]

is the robustness parameter,

i = 0, 1

. Obviously, when

k_{i} = + \infty

, the ellipsoidal uncertainty set is reduced to a single point, and thereby

m_{i} = {\bar{m}}_{i}

, i.e., there is no uncertainty.

Sensor Selection: In the WSN, sensors transmit their observations to a fusion node, which then performs the hypothesis testing based on its received measurements. Due to power constraints, suppose that only p out of m sensors are chosen (

p < m

) to transmit the observations to the fusion node. We aim at selecting p sensors to guarantee the best detection performance, that is, seeking a selection matrix

E \in R^{m \times p}

with 0/1 elements such that the best detection performance based on measurements

y = E^{T} x \in R^{p}

is achieved. It is easy to see that,

E

has exactly one unit entry per column (corresponds to a selected sensor) and at most one unit entry per row (each sensor is selected at most once). Therefore,

E

should be a column orthogonal matrix, i.e.,

E^{T} E = I .

Hypothesis Testing Induced by $E$ : Owing to

y = E^{T} x

, the original hypothesis testing (1) about

x

in the high dimensional space

R^{m}

is converted to one about

y

in a lower-dimensional space

R^{p}

:

\begin{matrix} H_{0} : y \sim N (E^{T} m_{0}, E^{T} S_{0} E), H_{1} : y \sim N (E^{T} m_{1}, E^{T} S_{1} E) . \end{matrix}

(2)

Without loss of generality, we assume that

N (E^{T} m_{0}, E^{T} S_{0} E) \neq N (E^{T} m_{1}, E^{T} S_{1} E) .

(3)

Otherwise, the hypothesis testing (2) makes no sense. For hypothesis testing (2), when

E

and

m_{i}

are determined, the fusion node executes the following likelihood ratio (LR) test:

l (y) : = \frac{f_{1} (y; E)}{f_{0} (y; E)} ≷_{H_{0}}^{H_{1}} γ,

where

f_{i} (y; E)

is the density of

N (E^{T} m_{i}, E^{T} S_{i} E)

,

i = 0, 1

, and

γ

is the test threshold [16].

Sensor Selection Optimality Criteria: Under a Bayesian framework, detection performance is quantified by the overall error probabilities

P_{e}

, where

P_{e} : = P (H_{0}) P_{F} + P (H_{1}) P_{M}

with

P_{F} : = P (l (y) > γ | H_{0})

and

P_{M} : = P (l (y) < γ | H_{1})

being the false alarm and the miss detection probabilities respectively. Then, the robust sensor selection problem in the hypothesis testing under a Bayesian framework is to seek the selection matrix

E

such that the maximum overall error probability

P_{e}

is minimized when making decisions with respect to hypothesis testing (2), that is, solving the following optimization problem:

\begin{matrix} min_{E \in R^{m \times p}} max_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} P_{e} \\ s . t . E^{T} E = I, E_{i, j} \in {0, 1} . \end{matrix}

(4)

2.2. Problem Transformation

Since computing

P_{e}

in problem (4) is usually difficult due to the involved integrals, we follow the popular approach in [11] to approximately optimize

P_{e}

. Based on the Chernoff lemma [17], when the number of independent identically distributed (i.i.d.) measurements increases, the rate of exponential decay of

P_{e}

is equal to the Chernoff distance between the two distributions

N (E^{T} m_{0}, E^{T} S_{0} E)

and

N (E^{T} m_{1}, E^{T} S_{1} E)

. On the other hand, according to the definition of Chernoff distance between two probability densities

f_{0}

and

f_{1}

:

D_{C} (f_{1}, f_{0}) : = max_{s \in [0, 1]} log \int f_{1}^{s} (x) f_{0}^{1 - s} (x) d x,

we derive that the Chernoff distance between distributions

N (E^{T} m_{0}, E^{T} S_{0} E)

and

N (E^{T} m_{1}, E^{T} S_{1} E)

is

f_{C} (E, m_{0}, m_{1}) : = max_{s \in [0, 1]} f (E, s, m_{0}, m_{1})

(5)

with

\begin{matrix} f (E, s, m_{0}, m_{1}) : = & \frac{1}{2} s (1 - s) {(m_{1} - m_{0})}^{T} E {(E^{T} (s S_{0} + (1 - s) S_{1}) E)}^{- 1} E^{T} \\ \times (m_{1} - m_{0}) - \frac{1}{2} log \frac{| E^{T} S_{0} E |^{s} {| E^{T} S_{1} E |}^{1 - s}}{| E^{T} (s S_{0} + (1 - s) S_{1}) E |} . \end{matrix}

(6)

Therefore, as in [11], minimizing the maximum overall error probability

P_{e}

can be approximately converted to maximizing the minimum Chernoff distance

f_{C} (E, m_{0}, m_{1})

. Accordingly, problem (4) is transformed into

\begin{matrix} max_{E} min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} f_{C} (E, m_{0}, m_{1}) \\ s . t . E^{T} E = I, E_{i, j} \in {0, 1}, \end{matrix}

(7)

or equivalently,

\begin{matrix} max_{E} min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} max_{s \in [0, 1]} f (E, s, m_{0}, m_{1}) \\ s . t . E^{T} E = I, E_{i, j} \in {0, 1} . \end{matrix}

(8)

Work [11] has proven that problem (7) is NP-hard, and proposed a suboptimal greedy method along with projection and refinement phases to deal with problem (7). Although the greedy algorithm-based method (i.e., R-C algorithm in [11]) admits a lower computation complexity than the exhaustive method, it can not arrive at the globally optimal solution in many cases, and still remains high computation complexity for large-scale problems. Therefore, we endeavor to propose a more efficient method to obtain a better solution of problem (7).

It is not difficult to see that, solving problem (7) can be sequentially divided into the inner minimization and outer maximization. For a given selection matrix

E

, defining

{\tilde{f}}_{C} (E) : = min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} f_{C} (E, m_{0}, m_{1}),

(9)

and denoting

({\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

as the optimal solution of the following subproblem

min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} f_{C} (E, m_{0}, m_{1}),

(10)

then it holds

{\tilde{f}}_{C} (E) = f_{C} (E, {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

. Correspondingly, problem (7) can be equivalently transformed into

\begin{matrix} Primal Problem (PP) : & max_{E \in R^{m \times p}} {\tilde{f}}_{C} (E) \\ s . t . E^{T} E = I, E_{i, j} \in {0, 1} . \end{matrix}

(PP)

Remarkably, the optimal solutions of problems (7) and (PP) are the same. Hence, we will discuss how to solve problem (PP) in the following. Taking the orthogonal constraint

E^{T} E = I

into account, we adopt the OCPGA-based method to deal with problem (PP).

3. OCPGA-Based Method

Referring to the greedy algorithm-based method proposed in [11], solving problem (PP) can be also successively divided into three phases: relaxation, projection and refinement. Then we will utilize the OCPGA-based method to solve problem (PP). That is, in the relaxation phase the OCPGA is used to handle the relaxed problem without 0/1 constraints, while the projection and refinement phases are the same as in the greedy algorithm-based method. First we provide the working flow of the OCPGA-based method for solving problem (PP) in Figure 1, and the details are shown in the following Section 3.1 and Section 3.2.

3.1. Relaxation Phase

By relaxing the 0/1 constraints, problem (PP) is reduced to

\begin{array}{l} Relaxed Problem (RP) : & max_{E \in R^{m \times p}} {\tilde{f}}_{C} (E) \\ s . t . E^{T} E = I . \end{array}

(RP)

Taking into consideration the orthogonal constraint in problem (RP), once the gradient

\nabla {\tilde{f}}_{C} (E)

of the objective function

{\tilde{f}}_{C} (E)

exists and is computable, we can implement the OCPGA in [14] to solve problem (RP), which is presented in Algorithm 1.

Algorithm 1: OCPGA

The update formula of

Y_{n} (τ)

in Algorithm 1 is the Cayley transformation [18], and thereby

Y_{n} (τ)

always satisfies the orthogonal constraint

Y_{n} {(τ)}^{T} Y_{n} (τ) = I

for each iteration n. Meanwhile, the stepsize

τ

in Algorithm 1 is chosen by curvilinear search algorithms [19] combined with the Barzilai-Borwein (BB) [20] nonmonotonic line search [21]. It has been shown by Lemma 2.2 and Remark 2.3 in [22] that the sequence generated by the OCPGA is globally convergent to a stationary point. For clarity and integrity, we provide the convergence result of Algorithm 1 in the following theorem without proof.

Theorem 1

([22]). When

{\tilde{f}}_{C} (E)

in problem (RP) is differentiable and the gradient

\nabla {\tilde{f}}_{C} (E)

is derived, then

E^{*}

obtained by the OCPGA in Algorithm 1 is a stationary point of problem (RP).

Moreover, the computation complexity of the OCPGA in Algorithm 1 is

O (m p^{2})

[14]. Combined with the fact that p is generally much smaller than m, the computation complexity of the OCPGA is much lower than that

O (m^{3} p)

of the greedy algorithm in [11]. Hence, our proposed OCPGA is more efficient than the greedy algorithm, particularly for a quite large m.

3.2. Projection and Refinement Phases

Generally, the solution

E^{*}

obtained by the OCPGA in Algorithm 1 is not a selection matrix, because the elements of

E^{*}

are not guaranteed to be 0/1. Thus we need to further execute the projection and refinement phases as in [11].

First, we seek matrix

\tilde{E}

closest to the range space of

E^{*}

by solving the following problem

\begin{matrix} min_{E} & ∥ E E^{T} - E^{*} {(E^{*})}^{T} ∥_{F} \\ s . t . & E_{i, j} \in {0, 1}, E^{T} E = I . \end{matrix}

(11)

It has been shown in [11] that problem (11) admits a closed-form solution. Specifically, let

(j_{1}, \dots, j_{p})

be the indexes of the p largest entries on the diagonal of

E^{*} {(E^{*})}^{T}

, and then

\tilde{E} = (i_{j_{1}}, \dots, i_{j_{p}})

is the optimal solution of problem (11), where

i_{j}

stands for the j-th column of the identity matrix

I_{m}

.

Subsequently, after projecting to the set of 0/1 selection matrices, we further implement a refinement around

\tilde{E}

. Setting

E = \tilde{E}

, the first column of

E

is viewed as the optimization variable, while all other columns are fixed. Then we sweep through all canonical vectors (i.e., columns of the identity matrix

I_{m}

) different from the remaining

p - 1

columns of

E

, and choose the one corresponding to the maximum

{\tilde{f}}_{C}

as the first column. In the next step, the procedure is repeated for the second column, and so on, up to the p-th step. Finally, one refinement is finished and the matrix

\hat{E} = E

is regarded as a solution of problem (PP). Obviously, the more we refine, the better solution we can achieve. When

p = 1

, the solution achieved by one refinement is indeed the globally optimal selection matrix of problem (PP). If time permits, we can execute the refinement phase for several times until the objective function value

{\tilde{f}}_{C}

keeps unchanged.

Notice that, the computation cost for projection is very small (as there exists an analytical solution), while in refinement we need to calculate the objective function value

{\tilde{f}}_{C}

for

(m - p) p

times. Since the OCPGA is more efficient than the greedy algorithm, and the remaining projection and refinement phases are the same, the OCPGA-based method naturally possesses higher efficiency than the greedy algorithm-based method.

Remark 1.

For the OCPGA in Algorithm 1, its initial point is randomly chosen from column orthogonal matrices. Experiments illustrate that different initial points may result in different outcomes. Therefore, in order to improve the performance, we implement the OCPGA with different initial points several times, and after projection and refinement, choose the best solution as the output.

Remark 2.

According to the above discussion, the key to implement the OCPGA-based method is computing the gradient

\nabla {\tilde{f}}_{C} (E)

in problem (RP). In the next section, we will show how to obtain the gradient

\nabla {\tilde{f}}_{C} (E)

and discuss when it exists.

4. Existence and Computation of the Gradient in Problem (RP)

As can be easily seen from Algorithm 1, it is essential to compute the gradient

\nabla {\tilde{f}}_{C} (E)

in problem (RP). Invoking the definition of

{\tilde{f}}_{C} (E)

in Equation (9), we have

{\tilde{f}}_{C} (E) = min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} f_{C} (E, m_{0}, m_{1}),

with

f_{C} (E, m_{0}, m_{1}) = max_{s \in [0, 1]} f (E, s, m_{0}, m_{1})

. To proceed, we first prove the strict concavity of

f (E, s, m_{0}, m_{1})

with respect to s, which forms a key ingredient of our later arguments.

Proposition 1.

Given

E

and

(m_{0}, m_{1})

,

f (E, s, m_{0}, m_{1})

defined by Equation (6) is a strictly concave function of s.

Proof.

It is easily seen that

f (E, s, m_{0}, m_{1})

is continuously differentiable with respect to s for given

E

and

(m_{0}, m_{1})

. Hence we turn to show the positiveness of the second derivative

\nabla_{s}^{2} f (E, s, m_{0}, m_{1})

.

For notation brevity, define

Λ : = (m_{1} - m_{0}) {(m_{1} - m_{0})}^{T}

and

B (s) : = E {(E^{T} (s S_{0} + (1 - s) S_{1}) E)}^{- 1} E^{T}

. It is clear that both

Λ

and

B (s)

are semi-positive definite. By taking derivative of

f (E, s, m_{0}, m_{1})

with respect to s, we have

\begin{matrix} \nabla_{s} f (E, s, m_{0}, m_{1}) = & (\frac{1}{2} - s) tr [B (s) Λ] + \frac{1}{2} tr [B (s) (S_{0} - S_{1})] \\ - \frac{1}{2} s (1 - s) tr [B (s) (S_{0} - S_{1}) B (s) Λ] - \frac{1}{2} log \frac{| E^{T} S_{0} E |}{| E^{T} S_{1} E |} . \end{matrix}

(12)

Then, taking derivative of

\nabla_{s} f (E, s, m_{0}, m_{1})

with respect to s once again, we obtain

\begin{matrix} \nabla_{s}^{2} f (E, s, m_{0}, m_{1}) = & - tr [B (s) Λ] + (2 s - 1) tr [B (s) (S_{0} - S_{1}) B (s) Λ] \\ + s (1 - s) tr \{{[B (s) (S_{0} - S_{1})]}^{2} B (s) Λ\} - \frac{1}{2} tr \{{[B (s) (S_{0} - S_{1})]}^{2}\} \\ \overset{(a)}{=} & - tr [B (s) Λ] + (s - 1) tr [B (s) (S_{0} - S_{1}) B (s) Λ] \\ + s tr [B (s) S_{0} B (s) (S_{0} - S_{1}) B (s) Λ] - \frac{1}{2} tr \{{[B (s) (S_{0} - S_{1})]}^{2}\} \\ \overset{(b)}{=} & - \frac{1}{1 - s} tr [B (s) S_{0} B (s) Λ] + \frac{s}{1 - s} tr [B (s) S_{0} B (s) S_{0} B (s) Λ] - \frac{1}{2} tr \{{[B (s) (S_{0} - S_{1})]}^{2}\} \\ \overset{(c)}{=} & - \frac{1}{1 - s} tr \{Λ^{\frac{1}{2}} B (s) S_{0} E Γ (s) E^{T} S_{0} B (s) Λ^{\frac{1}{2}}\} - \frac{1}{2} tr \{{[B (s) (S_{0} - S_{1})]}^{2}\}, \end{matrix}

(13)

where

Γ (s) = {(E^{T} S_{0} E)}^{- 1} - {(E^{T} S_{0} E + \frac{1 - s}{s} E^{T} S_{1} E)}^{- 1}

, equalities

(a)

and

(b)

both result from

B (s) = B (s) [s S_{0} + (1 - s) S_{1}] B (s)

and thus

(1 - s) B (s) (S_{0} - S_{1}) B (s) = B (s) S_{0} B (s) - B (s)

, and equality

(c)

owing to the definition of

B (s)

and the relationship

tr (AB) = tr (BA)

for matrices

A

and

B

.

If

E^{T} m_{0} \neq E^{T} m_{1}

, then we have

E^{T} Λ^{\frac{1}{2}} \neq 0

. Because of

E^{T} S_{0} E ≻ 0

, it holds

\begin{matrix} E^{T} S_{0} B (s) Λ^{\frac{1}{2}} = E^{T} S_{0} E {[A (s)]}^{- 1} E^{T} Λ^{\frac{1}{2}} \neq 0 \end{matrix}

with

A (s) = E^{T} (s S_{0} + (1 - s) S_{1}) E ≻ 0

. Combined with

Γ (s) ≻ 0

,

B (s) ⪰ 0

, and

tr {{[B (s) (S_{0} - S_{1})]}^{2}} = tr {B (s) [(S_{0} - S_{1}) B (s) (S_{0} - S_{1})]} ⩾ 0

, we conclude

\nabla_{s}^{2} f (E, s, m_{0}, m_{1}) < 0

from Equation (13).

On the other hand, when

E^{T} m_{0} = E^{T} m_{1}

, then it follows from Equation (3) that

E^{T} S_{0} E \neq E^{T} S_{1} E

. If

{[B (s) (S_{0} - S_{1})]}^{2} = 0

, Then we have

B (s) (S_{0} - S_{1}) B (s) = 0

, that is

\begin{matrix} E {[A (s)]}^{- 1} (E^{T} S_{0} E - E^{T} S_{1} E) {[A (s)]}^{- 1} E^{T} = 0 . \end{matrix}

Then it immediately follows

E^{T} S_{0} E - E^{T} S_{1} E = 0

, which leads to a contradiction. Therefore, when

E^{T} m_{0} = E^{T} m_{1}

, we deduce

tr \{{[B (s) (S_{0} - S_{1})]}^{2}\} > 0

, which means

\nabla_{s}^{2} f (E, s, m_{0}, m_{1}) < 0

by Equation (13).

As a consequence,

f (E, s, m_{0}, m_{1})

is a strictly concave function of s. □

Moreover, the following Proposition 2 shows that

f_{C} (E, m_{0}, m_{1})

is continuously differentiable with respect to

E

for given

(m_{0}, m_{1})

, which is also necessary for follow-up analysis.

Proposition 2.

Given

(m_{0}, m_{1})

,

f_{C} (E, m_{0}, m_{1})

defined by (5) is a continuously differentiable function of

E

.

Proof.

For given

E

and

(m_{0}, m_{1})

, denote

s^{*} : = arg max_{s \in [0, 1]} f (E, s, m_{0}, m_{1})

with

f (E, s, m_{0}, m_{1})

given by Equation (6). Note that

f (E, 0, m_{0}, m_{1}) = f (E, 1, m_{0}, m_{1}) = 0

and

f_{C} (E, m_{0}, m_{1}) = max_{s \in [0, 1]} f (E, s, m_{0}, m_{1}) > 0

under condition (3). Hence,

s^{*} \in (0, 1)

, which, combined with the strict concavity of

f (E, s, m_{0}, m_{1})

in Proposition 1, implies that

s^{*}

is unique with satisfying

\nabla_{s} f (E, s, m_{0}, m_{1}) = 0 .

Recalling the expression of

\nabla_{s} f (E, s, m_{0}, m_{1})

in (12), since all the involved terms

B (s)

,

tr (\cdot)

,

| \cdot |

,

log (\cdot)

and

E^{T} S_{0} E \neq 0

are continuously differentiable functions of E, then

\nabla_{s} f (E, s, m_{0}, m_{1})

as a composite function of the above functions is also continuously differentiable with respect to E. Similarly,

\nabla_{s}^{2} f (E, s, m_{0}, m_{1})

in (13) is a composition of functions

\frac{1}{1 - s}

,

B (s)

and

tr (\cdot)

, which are all continuous with respect to s. Hence,

\nabla_{s}^{2} f (E, s, m_{0}, m_{1})

is continuous with respect to s, that is,

\nabla_{s} f (E, s, m_{0}, m_{1})

is continuously differentiable with respect to s. Due to

\nabla_{s}^{2} f (E, s, m_{0}, m_{1}) \neq 0

, it follows from the implicit function theorem [23] that

s^{*}

is an implicit function of

E

. Hence, we rewrite

s^{*}

as

s^{*} (E) : = arg max_{s \in [0, 1]} f (E, s, m_{0}, m_{1}) .

(14)

Furthermore, based on the implicit function theorem [23],

s^{*} (E)

is a continuously differentiable function of

E

, which means that

\nabla s^{*} (E)

is continuous with respect to

E

.

Because of

f_{C} (E, m_{0}, m_{1}) = max_{s \in [0, 1]} f (E, s, m_{0}, m_{1})

, we have

f_{C} (E, m_{0}, m_{1}) = f (E, s^{*} (E), m_{0}, m_{1})

immediately. In addition,

f (E, s, m_{0}, m_{1})

is continuously differentiable with s and

E

. Combined with the fact that

s^{*} (E)

is a continuously differentiable function of

E

, thereby

f_{C} (E, m_{0}, m_{1})

is differentiable with respect to

E

.

Moreover, by leveraging on the chain rule [24] to

f_{C} (E, m_{0}, m_{1}) = f (E, s^{*} (E), m_{0}, m_{1})

, it holds

\begin{matrix} \nabla_{E} f_{C} (E, m_{0}, m_{1}) = [\nabla_{E} f (E, s, m_{0}, m_{1}) + \nabla_{s} f (E, s, m_{0}, m_{1}) \nabla s^{*} (E)] |_{s = s^{*} (E)}, \end{matrix}

where

\begin{matrix} \nabla_{E} f (E, s, m_{0}, m_{1}) & = s (1 - s) [I - W (s) E A^{- 1} (s) E^{T}] Λ E A^{- 1} (s) \\ - s S_{0} E {(E^{T} S_{0} E)}^{- 1} - (1 - s) S_{1} E {(E^{T} S_{1} E)}^{- 1} + W (s) E A^{- 1} (s) \end{matrix}

(15)

with

W (s) = s S_{0} + (1 - s) S_{1}

,

A (s) = E^{T} W (s) E

, and

Λ = (m_{1} - m_{0}) {(m_{1} - m_{0})}^{T}

. Since

\nabla s^{*} (E)

,

\nabla_{s} f (E, s, m_{0}, m_{1})

and

\nabla_{E} f (E, s, m_{0}, m_{1})

are all continuous with respect to

E

, then

\nabla_{E} f_{C} (E, m_{0}, m_{1})

is also continuous. In conclusion,

f_{C} (E, m_{0}, m_{1})

is a continuously differentiable function of

E

. □

4.1. Compute the Gradient in Problem (RP) by Danskin’s Theorem

In the sequel, we will exploit Danskin’s theorem in Appendix A to compute the gradient

\nabla {\tilde{f}}_{C} (E)

, where

{\tilde{f}}_{C} (E)

is defined by Equation (9).

On basis of Proposition 2, the following results hold true for function

f_{C} (E, m_{0}, m_{1})

and set

E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})

:

$E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})$ is a compact set because it is a finite-dimensional bounded closed set. Meanwhile, $\forall (m_{0}, m_{1})$ , mapping $(t, m_{0}, m_{1}) \to f_{C} (E + t h, m_{0}, m_{1})$ is continuous at point $(0, m_{0}, m_{1})$ due to the continuity of $f_{C} (E, m_{0}, m_{1})$ ;
For arbitrary given $(m_{0}, m_{1})$ and sufficiently small $t > 0$ , since $f_{C} (E, m_{0}, m_{1})$ is differentiable, there exists a bounded directional derivative

$\begin{matrix} D_{1} f_{C} (E + t h, m_{0}, m_{1}; h) = lim_{τ \to 0^{+}} \frac{1}{τ} \times [f_{C} (E + (t + τ) h, m_{0}, m_{1}) - f_{C} (E + t h, m_{0}, m_{1})]; \end{matrix}$
Mapping $(t, m_{0}, m_{1}) \to D_{1} f_{C} (E + t h, m_{0}, m_{1})$ is continuous at point $(0, m_{0}, m_{1})$ , which is from the continuity of $\nabla_{E} f_{C} (E, m_{0}, m_{1})$ .

With identifications

E \sim u

,

(m_{0}, m_{1}) \sim v

,

R^{m \times p} \sim U

,

E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1}) \sim V

,

- f_{C} (E, m_{0}, m_{1}) \sim J (u, v)

, and

- {\tilde{f}}_{C} (E) \sim \bar{J} (u)

, all conditions of Danskin’s theorem in Appendix A are satisfied. Subsequently, for a given selection matrix

E

, if the optimal solution

({\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

of problem (10) is unique, then the gradient of

{\tilde{f}}_{C} (E)

exists. Furthermore, on basis of the Danskin’s theorem, we have

\nabla {\tilde{f}}_{C} (E) = \nabla_{E} f_{C} (E, m_{0}, m_{1}) |_{(m_{0}, m_{1}) = ({\hat{m}}_{0} (E), {\hat{m}}_{1} (E))} .

(16)

Recall

f_{C} (E, m_{0}, m_{1}) = max_{s \in [0, 1]} f (E, s, m_{0}, m_{1})

from Equation (5). For given

(m_{0}, m_{1})

, we can again utilize the Danskin’s theorem in Appendix A to obtain

\nabla_{E} f_{C} (E, m_{0}, m_{1})

. Obviously,

f (E, s, m_{0}, m_{1})

is continuously differentiable with respect to

E

. Therefore, it is easy to verify that conditions 1)−3) of Danskin’s theorem in Appendix Appendix A are all satisfied with identifications

E \sim u

,

s \sim v

,

R^{m \times p} \sim U

,

[0, 1] \sim V

,

f (E, s, m_{0}, m_{1}) \sim J (u, v)

, and

f_{C} (E, m_{0}, m_{1}) \sim \bar{J} (u)

. Combined with the uniqueness of

s^{*} (E)

defined by Equation (14), we can deduce

\nabla_{E} f_{C} (E, m_{0}, m_{1}) = \nabla_{E} f (E, s, m_{0}, m_{1}) |_{s = s^{*} (E)} .

(17)

On the other hand, due to the equivalence of Problems (7) and (8), for a given matrix

E

, a solution

({\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

of Problem (10) corresponds to a solution

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

of problem

Iner Problem (IP) : min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} max_{s \in [0, 1]} f (E, s, m_{0}, m_{1}),

(IP)

satisfying

f_{C} (E, {\hat{m}}_{0} (E), {\hat{m}}_{1} (E)) = f (E, \hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

and

\hat{s} (E) = arg max_{s \in [0, 1]} f (E, s, {\hat{m}}_{0} (E), {\hat{m}}_{1} (E)) .

(18)

Combined with Equations (16) and (17), when the solution

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

is unique, we have

\begin{matrix} \nabla {\tilde{f}}_{C} (E) = \nabla_{E} f (s, E, m_{0}, m_{1}) |_{(s, m_{0}, m_{1}) = (\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))}, \end{matrix}

(19)

where

\nabla_{E} f (s, E, m_{0}, m_{1})

is given by Equation (15) and

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

is the optimal solution of problem (IP).

4.2. Compute the Optimal Solution of Problem (IP)

Owing to the Sion’s minimax theorem [25] and Lemma 5 in [11], for given

E

, we can exchange the orders of the minimization with respect to

(m_{0}, m_{1})

and the maximization with respect to s in problem (IP). Correspondingly, problem (IP) can be equivalently converted into

\begin{matrix} max_{s \in [0, 1]} min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} f (E, s, m_{0}, m_{1}) . \end{matrix}

(20)

Therefore, we will solve Problem (20) to attain the solution

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

of Problem (IP).

First, for given matrix

E

and parameter s, denote

(m_{0}^{*} (s), m_{1}^{*} (s)) : = arg min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} f (E, s, m_{0}, m_{1}) .

Combining with the expression of

f (E, s, m_{0}, m_{1})

in Equation (6) and removing items irrelevant to

(m_{0}, m_{1})

, then we can get

(m_{0}^{*} (s), m_{1}^{*} (s))

by solving the following subproblem

\begin{matrix} min_{(m_{0}, m_{1}) \in E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})} {(m_{1} - m_{0})}^{T} B (m_{1} - m_{0}) \end{matrix}

(21)

with

B : = E {(E^{T} (s S_{0} + (1 - s) S_{1}) E)}^{- 1} E^{T} ⪰ 0

. Clearly, Problem (21) is convex, which can be directly solved by the CVX toolbox [26] with computation complexity

O (p^{3})

[27]. Thus, once s is fixed, the corresponding

(m_{0}^{*} (s), m_{1}^{*} (s))

follows. Particularly,

({\hat{m}}_{0} (E), {\hat{m}}_{1} (E)) = (m_{0}^{*} (\hat{s} (E)), m_{1}^{*} (\hat{s} (E)))

.

Next, we will show how to determine the optimal

\hat{s} (E)

given by Equation (18). Based on Proposition 1,

f (E, s, m_{0}, m_{1})

is concave with respect to s. Combined with the fact that

f (E, s, m_{0}^{*} (s), m_{1}^{*} (s))

is the minimum of a family of functions

f (E, s, m_{0}, m_{1})

over the uncertainty set

E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1}) \times E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})

, we conclude that

f (E, s, m_{0}^{*} (s), m_{1}^{*} (s))

is also a concave function of s [28]. Therefore,

\nabla_{s} f (E, s, m_{0}^{*} (s), m_{1}^{*} (s))

is monotonically decreasing with respect to s. Subsequently, we apply the efficient bisection method to search

\hat{s} (E)

such that

\nabla_{s} f (E, \hat{s} (E), m_{0}^{*} (\hat{s} (E)), m_{1}^{*} (\hat{s} (E))) = 0 .

For given

E

, we can once again use the Danskin’s theorem in Appendix A to derive

\nabla_{s} f (E, s, m_{0}^{*} (s), m_{1}^{*} (s))

. Since

f (E, s, m_{0}, m_{1})

is continuously differentiable with respect to s, all conditions of Danskin’s theorem in Appendix A are met. Meanwhile, even if the optimal solution

(m_{0}^{*} (s), m_{1}^{*} (s))

of Problem (21) is not unique,

\nabla_{s} f (E, s, m_{0}^{*} (s), m_{1}^{*} (s))

in Equation (12) is always the same no matter which

(m_{0}^{*} (s), m_{1}^{*} (s))

is substituted. Therefore,

f (E, s, m_{0}^{*} (s), m_{1}^{*} (s))

is differentiable with respect to s, and we have

\begin{matrix} \nabla_{s} f (E, s, m_{0}^{*} (s), m_{1}^{*} (s)) = \nabla_{s} f (E, s, m_{0}, m_{1}) |_{(m_{0}, m_{1}) = (m_{0}^{*} (s), m_{1}^{*} (s))}, \end{matrix}

(22)

where

\nabla_{s} f (E, s, m_{0}, m_{1})

is given by Equation (12), and

(m_{0}^{*} (s), m_{1}^{*} (s))

is an arbitrary optimal solution of Problem (21).

After we get the optimal

\hat{s} (E)

, the corresponding

(m_{0}^{*} (\hat{s}), m_{1}^{*} (\hat{s}))

, i.e.,

({\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

, is acquired by solving Problem (21). That is, the optimal solution

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

of Problem (IP) is obtained. We summarize the process of solving

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

as the following Inner Procedure 1.

Inner Procedure 1: Computing the Optimal Solution of Problem (IP)

Remark 3.

If the true distribution under each hypothesis is exactly known, i.e., there is no uncertainty in the mean vector, then we omit the process of computing solution

(m_{0}^{*} (\hat{s}), m_{1}^{*} (\hat{s}))

of problem (21), and directly regard the true mean vector as

(m_{0}^{*} (\hat{s}), m_{1}^{*} (\hat{s}))

. Consequently, the robust sensor selection problem is reduced to one without uncertainty.

Based on the previous discussion, if the optimal solution

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

of Problem (IP) is unique, then the gradient

\nabla {\tilde{f}}_{C} (E)

exists and can be computed by Equation (19). Next we will discuss the existence of the gradient

\nabla {\tilde{f}}_{C} (E)

detailedly.

4.3. Existence of the Gradient in Problem (RP)

It has been shown that the uniqueness of the optimal solution

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

leads to the existence of the gradient

\nabla {\tilde{f}}_{C} (E)

. Therefore, in the following, we turn to show when

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

is unique. First the following lemma demonstrates the uniqueness of

\hat{s} (E)

.

Lemma 1.

For given

E

, let

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

be the optimal solution of Problem (IP). Then

\hat{s} (E)

is unique.

Proof.

We will prove by contradiction. Assume that

(\tilde{s} (E), {\tilde{m}}_{0} (E), {\tilde{m}}_{1} (E))

is also the optimal solution of problem (IP) with

\tilde{s} (E) \neq \hat{s} (E)

.

It is worth noting that, both

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

and

(\tilde{s} (E), {\tilde{m}}_{0} (E), {\tilde{m}}_{1} (E))

are the saddle points of problem (IP). According to Proposition 1.4 in [29], the set of saddle points admits the Cartesian product form. Hence,

(\tilde{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

is also a saddle point. That is, for given

({\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

, both

\hat{s} (E)

and

\tilde{s} (E)

are the optimal solutions of problem

max_{s \in [0, 1]} f (E, s, {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

, which contradicts with the strict concavity of

f (E, s, m_{0}, m_{1})

with respect to s. Consequently,

\hat{s} (E)

is unique. □

With the uniqueness of

\hat{s} (E)

, we additionally make the following assumption to guarantee the existence of

\nabla {\tilde{f}}_{C} (E)

.

Assumption 1.

For given matrix

E

, the saddle point

(\hat{s} (E), m_{0}^{*} (\hat{s} (E)), m_{1}^{*} (\hat{s} (E)))

of Problem (IP) is unique.

Remark 4.

Because Lemma 1 has proven that

\hat{s} (E)

is unique, if the solution

(m_{0}^{*} (\hat{s} (E)), m_{1}^{*} (\hat{s} (E)))

of Problem (21) is unique, then Assumption A1 is satisfied. Under some conditions (e.g., the distance of the two uncertainty sets

E ({\bar{m}}_{0}, k_{0} {S_{0}}^{- 1})

and

E ({\bar{m}}_{1}, k_{1} {S_{1}}^{- 1})

is large),

(m_{0}^{*} (\hat{s} (E)), m_{1}^{*} (\hat{s} (E)))

of Problem (21) is naturally unique. Therefore, Assumption 1 is not very restrictive.

Under Assumption 1,

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

is unique and achievable by Inner Procedure 1. Hence,

\nabla {\tilde{f}}_{C} (E)

exists and can be obtained by Equation (19). Subsequently, Algorithm 1 can be executed. Furthermore, on basis of Theorem 1 in Section 3.1, we conclude that

E^{*}

obtained by the OCPGA in Algorithm 1 is a stationary point of problem (RP), which lays a foundation to attain a better solution of the original problem (PP).

Remark 5.

If Assumption 1 is not satisfied, we could use the Clark generalized gradient [30] to replace the gradient in Algorithm 1. Based on the Danskin’s theorem in Appendix A, the Clark generalized gradient can also be attained by Equation (19), where an arbitrary solution of Problem (21) is used. Then Algorithm 1 is still applicable and preserves the orthogonal constraint in each iteration. Although the resulting solution is not guaranteed to be a stationary point of Problem (RP), however, after projection and refinement phases, numerical simulations illustrate that the performance of the final result is also acceptable.

5. Numerical Simulations

In this section, numerical examples are carried out to show that the OCPGA-based method can obtain better solutions than the greedy algorithm-based method in [11]. To this end, (1) for fixed-size sensor selection problems, i.e., the total number and selected number of sensors are fixed, with randomly generated 50 (or 20) cases with different uncertainty sets, we exhibit the proportions that the OCPGA-based method performs better than, the same as, and worse than the greedy algorithm-based method; (2) for small-scale sensor selection problems, we compare the OCPGA-based method with the greedy algorithm based method and the exhaustive method; (3) for larger-scale sensor selection problems, the OCPGA-based method is compared with the greedy algorithm based method; (4) for specific small-scale and larger-scale sensor selection problems, the corresponding receiver operating characteristic (ROC) curves are depicted. Notably, in cases of (1)–(3), the performance of the method is measured by the resulting Chernoff distance (i.e.,

{\tilde{f}}_{C} (E)

in Problem (PP)), that is, the method with larger Chernoff distance admits better performance. All the procedures are coded in MATLAB R2014b on an ASUS notebook with the Intel(R) Core(TM) i3-2310M CPU of 2.10GHz and memory of 6GB.

Assume that we need to choose p out of m sensors. In all simulations, for given

(m, p)

, the ellipsoidal uncertainty sets

E ({\bar{m}}_{i}, k_{i} {S_{i}}^{- 1})

in Problem (IP),

i = 0, 1

, which contain the true distribution under each hypothesis, are generated as follows. Elements of the estimated mean vectors

{\bar{m}}_{0}

and

{\bar{m}}_{1}

are randomly generated from

(0, 1)

and

(0, 2)

, respectively. The covariance matrix

S_{i}

is generated by

S_{i} = {P_{i} Σ_{i} P_{i}}^{T}

, where

P_{i}

is an orthogonal basis of

m \times m

-dimensional matrices whose elements are generally drawn from

(0, 1)

, and

Σ_{i}

is a diagonal matrix with diagonal entries randomly generated in

(0, 2)

,

i = 0, 1

. The robustness parameters

k_{0} = k_{1} = 1

.

When

(m, p)

and the ellipsoidal uncertainty sets

E ({\bar{m}}_{i}, k_{i} {S_{i}}^{- 1})

are given, we adopt Inner Procedure 1 to compute

(\hat{s} (E), {\hat{m}}_{0} (E), {\hat{m}}_{1} (E))

, use Equation (19) to obtain the gradient

\nabla {\tilde{f}}_{C} (E)

, and then apply the OCPGA in Algorithm 1 to get the stationary point

E^{*}

of Problem (RP). After the projection and refinement phases described in Section 3.2, the final solution

\hat{E}

of Problem (PP) is achieved. Meanwhile, the greedy algorithm-based method is a deterministic approach, that is, for given

(m, p)

and ellipsoidal uncertainty set, its outputs are all the same no matter how many times it is recalled. On the contrary, since the initial point of the OCPGA in Algorithm 1 is randomly chosen, the outputs of the OCPGA-based method may vary with initial points of the OCPGA. If time permits, the OCPGA-based method can be recalled for several times to achieve better performance. Moreover, the OCPGA and the greedy algorithm based methods both execute one refinement only.

Fixed-Size Examples: For 8 fixed pairs of small

(m, p)

, we give the proportions that the OCPGA-based method performs better than, the same as, and worse than the greedy algorithm-based method. For each

(m, p)

, by implementing the two methods with randomly generated 50 different ellipsoidal uncertainty sets (only 20 different ellipsoidal uncertainty sets for

(50, 5)

,

(80, 5)

, and

(100, 5)

due to their long runtimes), the corresponding results are listed in Table 1. Here, with each ellipsoidal uncertainty set, the OCPGA-based method is recalled for two times and the better result is selected as the output. As shown in Table 1, for each

(m, p)

, the OCPGA-based method performs as well as the greedy algorithm-based method in most cases, while the “better” proportion is more than twice as many as the “worse” proportion. Actually, simulations show that, for “worse” cases, if we recall the OCPGA-based method for more times, then it can perform as well as even better than the greedy algorithm-based method.

Small-Scale Examples: we consider small-scale sensor selection problems, whose globally optimal solutions can be attained by the exhaustive method. Hence, we compare the optimal Chernoff distance obtained by the OCPGA-based method with those of the greedy algorithm-based method and the exhaustive method. Suppose that

p = 3, 4, 5

out of

m = 10, 12, 15

sensors are chosen. The corresponding outputs of the three methods are given in Table 2, and the corresponding runtimes of the three methods are listed in Table 3. It can be easily seen from Table 2 that, the Chernoff distances achieved by the OCPGA-based method are larger than those of the existing greedy algorithm-based method. For the case of

(15, 3)

, the Chernoff distance achieved by the OCPGA-based method is even

100 %

larger than that of the greedy algorithm-based method. In particular, our proposed OCPGA-based method can attain the same performance as the exhaustive method. Meanwhile, it is shown in Table 3 that, the OCPGA-based method is more efficient than the greedy algorithm-based method, and both of them possess much shorter runtimes than the exhaustive method, which is coincident with the theoretical computation complexity analyses. Via simple computations, we can see from Table 3 that the runtime of the OCPGA-based method can be up to

48.72 %

shorter than that of greedy algorithm-based method (for the case of

(10, 4)

). Therefore, our proposed OCPGA-based method admits better performance in terms of not only the objective value but also the runtime.

Larger-Scale Examples: Now we consider larger-scale problems, where

m = 50, 80, 100

and

p = 5, 10, 15

. Since m and p are large, which leads to the failure of the exhaustive method, we only compare the obtained Chernoff distances of our OCPGA-based method with those of the greedy algorithm-based method. The corresponding results are exhibited in Table 4, while the corresponding runtimes of the two methods are displayed in Table 5. As we can see from Table 4, the OCPGA-based method can attain a better solution than the greedy algorithm-based method. In the case of

(50, 5)

, the Chernoff distance obtained by the OCPGA-based method can be

13.14 %

larger than that of the greedy algorithm-based method. Moreover, Table 5 illustrates that, the OCPGA-based method admits higher efficiency than the greedy algorithm -based method, and the runtime of the OCPGA-based method can reduce up to

42.19 %

in the case of

(50, 5)

. Compared with small-scale cases in Table 3, we can see from Table 5 that the improvement in runtime is more obvious, which is due to the larger gap between m and p for larger-scale cases. Hence, for larger-scale cases, the OCPGA-based method also can achieve better solutions than the greedy algorithm-based method with shorter runtimes.

ROC Curves for Specific Examples: By Monte Carlo simulations with 200,000 instantiations of the LR tests and calculating

P_{D} / P_{F}

with

P_{D} : = 1 - P_{M}

being the detection probability, we depict the ROC curves to show the validity of the OCPGA-based method. It is well known that the higher the ROC curve, the better the detection performance. For the case of

(m, p) = (10, 3)

in Table 2, we display the corresponding ROC curves of the exhaustive method, the greedy algorithm-based method and the OCPGA-based method. It can be seen from Figure 2 that the OCPGA-based method is superior to the greedy algorithm-based method, while it can attain the same performance as that of the exhaustive method. Similarly, for the case of

(m, p) = (50, 5)

in Table 4, Figure 3 illustrates that our proposed approach performs better than the existing greedy algorithm-based method.

In summary, compared with the greedy algorithm-based method, the OCPGA-based method not only admits a lower theoretical computation complexity, but also can obtain better solutions with shorter runtimes in numerical simulations.

6. Conclusions

We address the minimax robust sensor selection in the binary Gaussian distribution hypothesis testing of WSN with the distribution mean vector under each hypothesis drifting in an ellipsoidal uncertainty set. Under a Bayesian optimal criterion, minimizing the maximum overall error probability with respect to the selection matrix is approximately converted to maximizing the minimum Chernoff distance between the distributions under a null hypothesis and alternative hypothesis to be detected. Then, the gradient of the objective function of the converted maximization problem is computed by Danskin’s theorem. Furthermore, we apply the OCPGA to solve the relaxed maximization problem without 0/1 constraints, which can get a stationary point of the relaxed problem with lower computational complexity than the existing greedy algorithm. Numerical simulations demonstrate that the OCPGA-based method can attain better solutions than the greedy algorithm-based method with up to

48.72 %

shorter runtimes, and the OCPGA-based method is able to attain the globally optimal solution obtained by the exhaustive method for small-scale problems. In future work, we can consider cases where the distribution mean falls in other types of uncertainty sets such as the band model. In addition, the cases with not precisely known distribution covariance can also be a future research direction.

Author Contributions

Conceptualization and methodology, T.M., B.Q. and E.S.; software, T.M., D.N. and B.Q.; supervision, E.S.; writing–original draft, T.M.; writing–review and editing, E.S. and Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Sichuan Science and Technology Program under Grant 2019YJ0115, the National Natural Science Foundation of China under Grants 61473197, 11871357 and 61671411, and the National Key Research and Development Project under Grants 2018YFC0830303 and 2017YFE0119300.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

WSN	Wireless sensor network
KL	Kullback-Leibler
LR	Likelihood ratio
OCPGA	Orthogonal constraint-preserving gradient algorithm
PP	Primal Problem
RP	Relaxed Problem
IP	Inner Problem
ROC	Receiver operating characteristic

Appendix A. Danskin’s Theorem (Theorem D1 and Its Corollary [15])

Let U and V be subsets of a Banach space

U

and a topological space

V

respectively. J is a mapping from

U \times V

into

R

. Assume

\bar{J} (u) = sup_{v \in V} J (u, v)

, and

\hat{V} (u) = v \in V | J (u, v) = \bar{J} (u)

. Moreover, suppose the following conditions hold true

(1): V is compact, $\forall v \in V$ , the application $(t, v) \to J (u + t h, v)$ is upper semi-continuous at $(0, v)$ ;
(2): $\forall v \in V$ and $\forall t$ in a right neighborhood of 0, there exists a bounded directional derivative

$\begin{matrix} D_{1} J (u + t h, v; h) = lim_{τ \to 0^{+}} \frac{1}{τ} [J (u + (t + τ) h, v) - J (u + t h, v)]; \end{matrix}$
(3): the map $(t, v) \to D_{1} J (u + t h, v)$ is upper semi-continuous at $(0, v)$ .

Then the function

\bar{J}

has a directional derivative at u in the direction h, given by the formula

D \bar{J} (u; h) = max_{v \in \hat{V} (u)} D_{1} J (u, v; h)} .

Moreover, If

u \to J (u, v)

has a Gateaux derivative

J_{u}^{'}

, and if the max is unique:

\hat{V} (u) = {\hat{v}}

, then

\bar{J}

has a Gateaux derivative

{\bar{J}}^{'} (u)

given by the simple formula

{\bar{J}}^{'} (u) = J_{u}^{'} (u, \hat{v}) .

References

Gerla, M.; Lee, E.K.; Pau, G.; Lee, U. Internet of vehicles: From intelligent grid to autonomous cars and vehicular clouds. In Proceedings of the IEEE World Forum on Internet of Things, Seoul, Korea, 6–8 March 2014; pp. 241–246. [Google Scholar]
Bahrepour, M.; Meratnia, N.; Poel, M.; Taghikhaki, Z.; Havinga, P.J. Distributed event detection in wireless sensor networks for disaster management. In Proceedings of the IEEE International Conference on Intelligent Networking and Collaborative Systems, Thessaloniki, Greece, 24–26 November 2010; pp. 507–512. [Google Scholar]
Rowaihy, H.; Eswaran, S.; Johnson, M.; Verma, D.; Bar-Noy, A.; Brown, T.; La Porta, T. A survey of sensor selection schemes in wireless sensor networks. In Proceedings of the Unattended Ground, Sea, and Air Sensor Technologies and Applications IX, Orlando, FL, USA, 11 May 2007; p. 65621A. [Google Scholar]
Guy, C. Wireless sensor networks. In Proceedings of the Sixth International Symposium on Instrumentation and Control Technology: Signal Analysis, Measurement Theory, Photo-Electronic Technology, and Artificial Intelligence, Beijing, China, 24 October 2006; p. 63571I. [Google Scholar]
Bhattacharya, D.; Krishnamoorthy, R. Power optimization in wireless sensor networks. Int. J. Comput. Sci. Issues 2011, 8, 415–419. [Google Scholar]
Weimer, J.E.; Sinopoli, B.; Krogh, B.H. A relaxation approach to dynamic sensor selection in large-scale wireless networks. In Proceedings of the IEEE International Conference on Distributed Computing Systems Workshops, Beijing, China, 17–20 June 2008; pp. 501–506. [Google Scholar]
Shen, X.; Varshney, P.K. Sensor selection based on generalized information gain for target tracking in large sensor networks. IEEE Trans. Signal Process. 2013, 62, 363–375. [Google Scholar] [CrossRef] [Green Version]
Liu, L.; Wang, S.; Liu, D.; Zhang, Y.; Peng, Y. Entropy-based sensor selection for condition monitoring and prognostics of aircraft engine. Microelectron. Reliab. 2015, 55, 2092–2096. [Google Scholar] [CrossRef]
Bajovic, D.; Sinopoli, B.; Xavier, J. Sensor selection for hypothesis testing in wireless sensor networks: A Kullback-Leibler based approach. In Proceedings of the 48th IEEE Conference on Decision Control, Shanghai, China, 15–18 December 2009; pp. 1659–1664. [Google Scholar]
Bajovic, D.; Sinopoli, B.; Xavier, J. Robust linear dimensionality reduction for hypothesis testing with application to sensor selection. In Proceedings of the 47th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 30 September–2 October 2009; pp. 363–370. [Google Scholar]
Bajovic, D.; Sinopoli, B.; Xavier, J. Sensor selection for event detection in wireless sensor networks. IEEE Signal Process. Mag. 2011, 59, 4938–4953. [Google Scholar] [CrossRef] [Green Version]
Chepuri, S.P.; Leus, G. Sparsity-promoting sensor selection for non-linear measurement models. IEEE Signal Process. Mag. 2015, 63, 684–698. [Google Scholar] [CrossRef] [Green Version]
Bai, C.Z.; Katewa, V.; Gupta, V.; Huang, Y.F. A stochastic sensor selection scheme for sequential hypothesis testing with multiple sensors. IEEE Trans. Signal Process. 2015, 63, 3687–3699. [Google Scholar] [CrossRef]
Wen, Z.; Yin, W. A feasible method for optimization with orthogonality constraints. Math. Program. 2013, 142, 397–434. [Google Scholar] [CrossRef] [Green Version]
Bernhard, P.; Rapaport, A. On a theorem of Danskin with an application to a theorem of Von Neumann-Sion. Nonlinear Anal. 1995, 24, 1163–1181. [Google Scholar] [CrossRef]
Scharf, L.L. Statistical Signal Processing: Detection, Estimation and Time Series Analysis; Addison-Wesley Publishing Company: New York, NY, USA, 1991. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 1991. [Google Scholar]
Wu, F.C.; Wang, Z.H.; Hu, Z.Y. Cayley transformation and numerical stability of Calibration equation. Int. J. Comput. Vis. 2009, 82, 156–184. [Google Scholar] [CrossRef]
Goldfarb, D.; Wen, Z.; Yin, W. A curvilinear search method for p-harmonic flows on spheres. SIAM J. Imaging Sci. 2009, 2, 84–109. [Google Scholar] [CrossRef] [Green Version]
Barzilai, J.; Borwein, J.M. Two-point step size gradient methods. IMA J. Numer. Anal. 1988, 8, 141–148. [Google Scholar] [CrossRef]
Zhang, H.C.; Hager, W.W. A nonmonotone line search technique and its application to unconstrained optimization. SIAM J. Optim. 2004, 14, 1043–1056. [Google Scholar] [CrossRef] [Green Version]
Gao, B.; Liu, X.; Chen, X.; Yuan, Y. A new first-order framework for orthogonal constrained optimization problems. SIAM J. Optim. 2018, 28, 302–332. [Google Scholar] [CrossRef]
Xia, D.X.; Shu, W.C.; Yan, S.Z.; Tong, Y.S. Second Course for Functional Analysis; Higher Education Press: Beijing, China, 1986. [Google Scholar]
Bonnans, J.F.; Shapiro, A. Perturbation Analysis of Optimization Problems; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Ricceri, B.; Simons, S. Minimax Theory and Applications; Kluwer Academic Pub: Dordrecht, The Netherlands, 1998. [Google Scholar]
Grant, M.; Boyd, S.; Ye, Y. CVX Toolbox; Stanford University Press: Stanford, CA, USA, 2009. [Google Scholar]
Nemirovskii, A.; Nesterov, Y. Interior-Point Polynomial Algorithms in Convex Programming; SIAM: Philadelphia, PA, USA, 1994. [Google Scholar]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Ekeland, I.; Temam, R. Convex Analysis and Variational Problems; SIAM: Philadelphia, PA, USA, 1999. [Google Scholar]
Clarke, F.H. Nonsmooth analysis and optimization. In Proceedings of the International Congress of Mathematicians, Helsinki, Finland, 15–23 August 1978; pp. 847–853. [Google Scholar]

Figure 1. Working Flow of the OCPGA Based Method for Solving Problem (11).

Figure 2. ROC curve for the small-scale case.

Figure 3. ROC curve for the larger-scale case.

Table 1. Performance Comparison Proportions.

$(m, p)$	Better	Same	Worse
$(10, 3)$	20%	74%	6%
$(10, 4)$	24%	72%	4%
$(10, 5)$	20%	76%	4%
$(12, 3)$	18%	74%	8%
$(15, 3)$	22%	74%	4%
$(50, 5)$	20%	70%	10%
$(80, 5)$	15%	80%	5%
$(100, 5)$	10%	85%	5%

Table 2. Optimal Chernoff Distances of Three Methods.

m	$p = 3$			$p = 4$			$p = 5$
m	Exh	Gre	OCP	Exh	Gre	OCP	Exh	Gre	OCP
10	0.15	0.11	0.15	0.34	0.28	0.34	0.29	0.26	0.29
12	0.51	0.35	0.51	0.25	0.21	0.25	0.62	0.34	0.62
15	0.30	0.15	0.30	0.77	0.76	0.77	0.20	0.19	0.20

Exh: Exhaustive Method; Gre: Greedy Algorithm Based Method; OCP: OCPGA Based Method.

Table 3. Runtimes of Three Methods (s).

m	$p = 3$			$p = 4$			$p = 5$
m	Exh	Gre	OCP	Exh	Gre	OCP	Exh	Gre	OCP
10	2.59 $\times 10^{3}$	52.12	51.07	1.10 $\times 10^{4}$	204.77	105.01	6.49 $\times 10^{4}$	87.07	80.39
12	4.17 $\times 10^{3}$	128.75	109.35	2.66 $\times 10^{4}$	82.41	75.94	2.04 $\times 10^{5}$	106.53	98.31
15	4.33 $\times 10^{4}$	92.68	89.28	1.50 $\times 10^{5}$	193.45	188.95	1.20 $\times 10^{6}$	153.09	150.41

Exh: Exhaustive Method; Gre: Greedy Algorithm Based Method; OCP: OCPGA Based Method.

Table 4. Optimal Chernoff Distances of Two Methods.

m	$p = 5$		$p = 10$		$p = 15$
m	Gre	OCP	Gre	OCP	Gre	OCP
50	1.37	1.55	3.59	3.68	3.82	3.85
80	1.32	1.34	3.36	3.41	5.86	5.92
100	1.69	1.71	4.46	4.56	5.74	6.06

Gre: Greedy Algorithm Based Method; OCP: OCPGA Based Method.

Table 5. Runtimes of Two Methods (s).

m	$p = 5$		$p = 10$		$p = 15$
m	Gre	OCP	Gre	OCP	Gre	OCP
50	1499.76	866.96	2083.39	1956.96	3794.17	3635.68
80	1839.27	1718.73	5435.69	5251.96	5350.99	5240.56
100	2290.26	1752.71	5589.49	5273.85	11966.38	11433.33

Gre: Greedy Algorithm Based Method; OCP: OCPGA Based Method.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, T.; Qian, B.; Niu, D.; Song, E.; Shi, Q. A Gradient-Based Method for Robust Sensor Selection in Hypothesis Testing. Sensors 2020, 20, 697. https://doi.org/10.3390/s20030697

AMA Style

Ma T, Qian B, Niu D, Song E, Shi Q. A Gradient-Based Method for Robust Sensor Selection in Hypothesis Testing. Sensors. 2020; 20(3):697. https://doi.org/10.3390/s20030697

Chicago/Turabian Style

Ma, Ting, Bo Qian, Dunbiao Niu, Enbin Song, and Qingjiang Shi. 2020. "A Gradient-Based Method for Robust Sensor Selection in Hypothesis Testing" Sensors 20, no. 3: 697. https://doi.org/10.3390/s20030697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Gradient-Based Method for Robust Sensor Selection in Hypothesis Testing

Abstract

1. Introduction

2. Problem Formulation

2.1. System Model

2.2. Problem Transformation

3. OCPGA-Based Method

3.1. Relaxation Phase

3.2. Projection and Refinement Phases

4. Existence and Computation of the Gradient in Problem (RP)

4.1. Compute the Gradient in Problem (RP) by Danskin’s Theorem

4.2. Compute the Optimal Solution of Problem (IP)

4.3. Existence of the Gradient in Problem (RP)

5. Numerical Simulations

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

Appendix A. Danskin’s Theorem (Theorem D1 and Its Corollary [15])

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI