1 Introduction

Modern engineering design is a complex and systematic task. Different disciplines are involved and various goals are set depending on the design stages. In the early stage of the design pipeline, a trade-off between conflicting design requirements must be found to generate satisfying solutions for the problem at hand. Without loss of generality, these requirements are presented as constraints. The designs satisfying such constraints are considered feasible and can be good candidates for detailed investigation. In practice, Autodesk [1] pointed out that, in the preliminary design stage, designers would use design optimization (e.g., [2, 3]) as a heuristic design tool. These algorithms are originally proposed to push the performance of a design problem to its extreme as a finalizing step in the design process [4], but now optimization has been misused to explore the space of feasible designs at the beginning of the design cycle. This interesting phenomenon stems from the fact that there is a need for a generative design process where designers are presented with several interesting design candidates.

Nevertheless, the sole emphasis on optimality by optimization algorithms may not be the best choice at this stage and there is a need of techniques identifying all the suitable designs. Several arguments support this claim. First, the optimal design configuration may be just used as the start of the design process [4], being too myopic at a limited number of optima may ignore some interesting design candidates. Second, especially at the early design stage, the objectives of the system that need to be optimized may not yet be clearly defined [5] and designers need a strategy to gain understanding about the design space. In this case, unconventional but feasible designs may also be valuable and worth further investigation. Additionally, the obtained result may not transfer well to reality [6] (a phenomenon called the reality gap [7]). In these cases, a large collection of suitable design configurations can quickly provide a working solution.

This alternative interpretation can be formulated as a feasible region identification problem, where one seeks to make accurate predictions if a design configuration is feasible or not:

$$\begin{aligned} \begin{aligned}&\text {Find}\ A := \{{\varvec{x}} \in {\mathcal {X}} : I_A(f({\varvec{x}}))=1\}\\&\text {where}\ I_A(f({\varvec{x}}))= {\left\{ \begin{array}{ll} 1&{} \alpha \le f({\varvec{x}}) \le \beta \\ 0&{} \text {otherwise} \end{array}\right. } \end{aligned} \end{aligned}$$
(1)

\(f: {\mathcal {X}}\rightarrow {\mathbb {R}}\) is a noise-free, continuous, black box constraint function defined on a compact set \({\mathcal {X}} \subset {\mathbb {R}}^d\). \(I_A(\cdot )\) is an indicator function defined on an interval of the output space explicitly specified by the designer, which denotes if design requirements are satisfied. In the bounded space \({\mathcal {X}}\), a design \({\varvec{x}}\) is called feasible if its corresponding constraint function value \(f({\varvec{x}})\) is in the interval \([\alpha , \beta ]\). The region is called feasible region if all designs within are feasible. Specifically, we denote \(\alpha\), \(\beta\) \(\in \{{\mathbb {R}}, -\infty , \infty \}\) as feasible boundary, while \(\{{\varvec{x}} \in {\mathcal {X}} : f({\varvec{x}}) \in \{\alpha , \beta \} \}\) is denoted as feasible region boundary. In general, more than one constraint f can be applied in a design problem, hence the feasible region will be the intersection of the ones corresponding to each different f.

In this framework, computer-aided design (CAD) simulations are often involved to evaluate if the constraints are satisfied, leading to feasible design configurations. However, given the complexity of modern engineering systems, CAD simulations can be both time- and resource-intensive. For instance, an electromagnetic simulator or a computational fluid dynamic solver can take hours [8] or even days [9] to perform a single evaluation of f. Hence, it is of paramount importance to limit the number of (expensive) evaluations during the design phase, while also being able to provide a large number of feasible designs with high accuracy. In this scenario, active learning [10, 11] (or adaptive samplingFootnote 1 [12,13,14]) is an elegant tool to solve this problem. Active learning (AL) is also known as the sequential experimental design method in statistics [15]: it sequentially extends the data set to explore the design space in a sample-efficient way, to reduce uncertainty with respect to certain properties. Such desired properties can either be the maximum location of a function, which some examples are shown in Bayesian Optimization (or surrogate-based optimization) [16, 17], the overall accuracy of regression model [18], or a feasible region satisfying all constraints. The last case is considered in our study.

In this paper, we introduce an information theoretic-based active learning framework to make accurate feasibility predictions within a bounded design space. To reach an efficient termination of the active learning process, a new framework of termination criterion, which we alternatively refer to as stopping criterion, is proposed to provide a probabilistic belief of the metamodel’s prediction accuracy. We theoretically investigate the relations between the proposed stopping criterion and the test accuracy metrics commonly used for binary classification problem. The performance of our novel method are validated through comparison with respect to state-of-the-art methods on synthetic experiments. Additionally, two real-world design problems, including one involving multiple design constraints, are also presented.

This paper is structured as follows. First, the background and related techniques are described in Sect. 2. The novel active learning framework, including the active learning method and the framework of the stopping criteria, is introduced in Sect. 3. The numerical experiments and real-world applications are presented in Sect. 4. Conclusions are summed up in Sect. 5.

2 Background and related work

Fig. 1
figure 1

Generic flowchart of active learning for feasible region identification

The generic active learning framework for feasible region identification problems is shown in Fig. 1. Given the initial samples, the corresponding function values are evaluated by expensive black box functions. Subsequently, a metamodel is fitted to the data. Next, a new sample of the design parameters is selected according to some specified criterion built upon the model (infill criterion or acquisition function). The key characteristics of the active learning are twofold: the model provides a description of the feasible region, and one can utilize the model prediction to formulate an acquisition function, which provides a measure how informative an unevaluated sample can be. Then, the acquisition function is used to choose the next sample to evaluate. The model is updated accordingly to provide a new (and more accurate) description of the feasible region. This process is repeated until a stopping criterion is met. Finally, the feasible region is accurately described by the metamodel.

2.1 Metamodels for characterizing feasibility

A metamodel, or surrogate model, is a numerical approximation of the black box constraint function. Making feasibility predictions with metamodels is a common strategy. This problem has a close relation with reliability analysis [19], where the computation complexity is mainly from the statistical estimation of f in Eq. (1). In other contexts, this problem is also widely studied [20] and is referred to as excursion set [21, 22] and level set [23, 24] estimation, where one wants to predict if an unevaluated observation’s outcome has surpassed a specified level. From Eq. (1), a unified view can be given by regarding level set estimation, excursion search and feasible region identification as the same problem.Footnote 2

To make accurate predictions, several approaches have been proposed in literature [5, 13, 25, 26]. These methodologies are regression-based, rather than classification-based, since f is a real-valued continuous function in many relevant engineering applications. We refer [11, 27, 28] for classification-based methods in case of f is discrete-valued.

2.2 Active feasible region identification

Once the metamodel has been constructed, active learning will improve the confidence in predicting the feasible region. Several techniques have been presented in literature to actively identify the feasible region. An early one actively selects queries to learn the boundaries that divide a function domain into regions where the function is above or below a given threshold [26]. A heuristic acquisition function has also been proposed to encourage sampling near the feasible region boundary, while having large uncertainty at the same time. A weighted integration mean square error (\(\text {IMSE}_{T}\)) strategy is proposed by Picheny [25] to measure the belief of the metamodel about the accuracy of the feasible region. A new sampling technique is then developed, based on the fact that model variance is not affected by the function response of new data. Several stepwise uncertainty reduction (SUR) strategies for the excursion set estimation problem have also been developed in [20], which employ sequential sampling to minimize the estimation error or uncertainty, measured by Bayesian risk. Some SUR methods’ consistency property have also been established recently [29]. A variant of Probability of Feasibility (PoF) [30] has been proposed in [31], where a predicting variance term has been heuristically added to encourage exploring within the design space. Since most of the works assume the presence of the feasible region within a bounded design space, an active expansion sampling technique is introduced in [11] for unbounded design space problem.

2.3 Stopping criterion

Considering the active learning framework in Fig. 1, it is fundamental to define a metric to stop the iterative refinement of the metamodel.

To define such a stopping criterion, it is necessary to assess when the feasible region is accurately characterized by the metamodel, i.e., if the metamodel can accurately predict if a design configuration is feasible or not. As different performance metrics have been used for regression and classification problems, first the proper problem type must be determined. The goal is to predict whether a data is within the feasible region: \(I_A(f(\cdot ))\rightarrow \{0, 1\}\). This can be formulated as a binary classification problem, thus evaluated by measurements for classification metrics such as the F1 score:

$$\begin{aligned} \text {F}_1 = 2 \frac{\text {precision} \cdot \text {recall}}{\text {precision} + \text {recall}} \end{aligned}$$
(2)

where precision and recall are defined as:

$$\begin{aligned}&\text {precision} = \frac{\text {true positives}}{\text {true positives} + \text {false positives}} \end{aligned}$$
(3)
$$\begin{aligned}&\text {recall} = \frac{\text {true positives}}{\text {true positives} + \text {false negatives}}. \end{aligned}$$
(4)

Either precision, recall or F1 score can be used as a performance measurement, with each one having its own merits. Precision (Eq. 3) describes to what extent the predicted feasible region actually corresponds to a region satisfying the feasibility criteria, nevertheless, it lacks information of indicating whether all the feasible regions in the design space have be captured by the model. On the other hand, the recall metric (Eq. 4) provides more emphasis on to what extent all the feasible regions have been captured by the model, though it provides no indication when the models incorrectly predicts infeasible regions as feasible ones. The F1 score, defined as the harmonic mean of precision and recall, provides a good balance for considering all these circumstances. This metric has been widely used in several applications [5, 11, 32] and is particularly suitable in our scenario.

Unfortunately, the direct application of the F1 score (as well as the precision and recall) is costly because of the requirement of suitable test data. Thus, common AL methods [5, 11, 19] use a maximum number of function evaluations as the stopping criterion. However, it is not possible to estimate upfront how many function evaluation are necessary to obtain an accurate estimate of the feasible region: adopting a predefined maximum number of function evaluations can lead to an inaccurate estimation of the feasible region or to perform too many (unnecessary) expensive evaluations of the function f. To circumvent this problem, a common approach is to determine the stopping criteria purely based on the metamodel. Perhaps the most relevant criterion is the IMSE\(_T\) proposed in [25], in which the Kriging predicted variance is weighted by a probability specifying how likely a point is in the feasible region, and then averaged within the design space. The IMSE\(_T\) metric presents as elegant way to measure model’s belief about its accuracy within the feasible region. However, the use of this unscaled metric in reality still depends on expert knowledge to judge if the averaged variance is tolerable or not.

3 Active feasible region identification with stopping criterion

The proposed novel AL framework is presented in this section, where we first assume that the constraint functions are continuous-valued like in [5, 33] and can be described via Gaussian process regression. Then, the acquisition function is presented in (Sect. 3.2) and the new stopping criterion is proposed in (Sect. 3.3).

Fig. 2
figure 2

Flowchart of the active learning framework: the Gaussian process is utilized to provide a probabilistic description of the feasible region. The data is sequentially acquired by the information-theoretic based acquisition function. The whole active learning process will terminate until the metamodel is 95% confident that its prediction accuracy is larger than a predefined acceptable F1 score C

The main workflow of our active learning consists of the following steps:

  1. 1.

    Use design of experiments (DOE) techniques to select an initial set of design samples to evaluate.

  2. 2.

    In each subsequent iteration:

    1. (a)

      Build a GP regression model on the evaluated samples (Sect. 3.1)

    2. (b)

      Check if the stopping criterion has been satisfied (Sect. 3.3), if yes then proceed to step 3, otherwise continue.

    3. (c)

      Select a new sample to be evaluated by maximizing the acquisition function (Sect. 3.2)

    4. (d)

      Evaluate the new sample, augment the existing samples and go back to step 2(a)

  3. 3.

    Exit the AL process and return the final metamodel for prediction.

The proposed framework is illustrated in Fig. 2.

3.1 Gaussian process regression

There are many regression models that can be used to cheaply approximate the simulator, such as Bayesian neural networks, support vector regression, and random forest regression. Nevertheless, under the category of data-efficient machine learning, Gaussian process (GP), also known as Kriging [34, 35], is arguably the most prevalent probabilistic technique. GPs have been widely used as a regression method in many different areas including electronics engineering [31], computational fluid dynamics [36] and chemistry [37].

A GP can be considered as a distribution over functions \(f: {\mathcal {X}} \rightarrow {\mathbb {R}}\) that is completely defined by a suitable mean function \(m: {\mathcal {X}} \rightarrow {\mathbb {R}}\) and covariance function \(k: {\mathcal {X}} \times {\mathcal {X}} \rightarrow {\mathbb {R}}\). Starting from the GP prior and the training data \(D_n\), it is possible to derive a posterior belief of the function distribution aggregated at (possibly multiple) test data \({\varvec{X}}_*\). This belief is Gaussian distributed:

$$\begin{aligned} {\varvec{Y}}_* \vert D_n, {\varvec{X}}_* \sim {\mathcal {N}}(K_{*n}^\mathrm{{T}}(K_{nn} + \sigma ^2 I)^{-1} {\varvec{y}}, K_{**} - K_{n*}^\mathrm{{T}}(K_{nn} + \sigma ^2 I)^{-1} K_{n*}) \end{aligned}$$
(5)

where \({\varvec{X}}_*\) denotes the (possible multiple) test point, \(K_{nn}\) is the kernel matrix between training samples \(D_n\), \(K_{n*}\) is the kernel matrix between the training samples and the test point and \(\sigma ^2\) is the variance of the Gaussian likelihood, which usually represents the noise level of training samples. Note that, Eq. (5) also reveals that sampling a path of a Gaussian process based on discrete observations is equivalent with sampling a known multivariate Gaussian distribution; this property will be heavily utilized for the proposed stopping criterion. The interested reader is referred to [38] for a detailed description of GPs.

Once the GP metamodel has been built, it can be utilized to make feasibility predictions. Since the GP posterior prediction is a Gaussian distribution, we use its mean to make the feasibility predictions, i.e., the feasible region is modeled by:

$$\begin{aligned} A_\mu := \{{\varvec{x}} \in {\mathcal {X}} : I_A(\mu ({\varvec{x}}))=1\} \end{aligned}$$
(6)

Note that the use of the GP posterior mean as the feasibility predictor also embeds our assumption that, in case there is a misclassification of a design configuration’s feasibility, the cost of either wrongly predicting it as feasible (false positive) or wrongly predicting it as infeasible (false negative) is equivalent from a decision theoretic perspective [38]. We refer to [21] in case these two costs are not equivalent.

3.2 Acquisition function based on uncertainty reduction

The information-theoretic based acquisition function presented in [5, 39, 40] is adopted. The main intuition behind it is to sample the point that maximizes the expected information gain of an interval g, given the new data \({\varvec{x}}\). This is quantified by the reduction of differential entropy:

$$\begin{aligned} \alpha _g({\varvec{x}}) = {\mathbb {H}}(p(g \vert D_n)) - {\mathbb {E}}_{p(f_{{\varvec{x}}}\vert D_n, {\varvec{x}})}\left[ {\mathbb {H}}(p(g \vert D_n \cup \{{\varvec{x}}, f_{{\varvec{x}}}\}))\right] \end{aligned}$$
(7)

where \(p(f_{{\varvec{x}}}\vert D_n, {\varvec{x}})\) denotes the predictive response distribution at \({\varvec{x}}\), which in case of a GP is a Gaussian distribution, \({\mathbb {E}}(\cdot )\) denotes the operation of expectation through possible outcomes of \(f_{{\varvec{x}}}\), while \({\mathbb {H}}(\cdot )\) denotes the differential entropy, which can be used as an aggregated representation of a probability distribution’s uncertainty.

However, the calculation of Eq. (7) is intractable, since the GP posterior must be updated to calculate \(p(g \vert D_n \cup \{{\varvec{x}}, f_{{\varvec{x}}}\})\) for a certain location \({\varvec{x}}\). To alleviate the computational burden, a common alternative representation [41] arises by noting that Eq. (7) actually represents the conditional mutual information: \(\text {I}\{g, f_{{\varvec{x}}} \vert D_n, {\varvec{x}}\}\). Hence, Eq. (7) can be reformulated as its symmetric form:

$$\begin{aligned} \alpha _g({\varvec{x}}) = {\mathbb {H}}(p(f_{{\varvec{x}}} \vert D_n,{\varvec{x}})) - {\mathbb {E}}_{p(g \vert D_n )}\left[ {\mathbb {H}}(p(f_{{\varvec{x}}} \vert D_n, {\varvec{x}}, g))\right] \end{aligned}$$
(8)

Let us suppose that the feasible boundary is \([\alpha , \beta ]\); the acquisition function is then formulated to maximize the expected information gain about the region of interest and its complementFootnote 3

$$\begin{aligned} \begin{aligned} \alpha ({\varvec{x}})&= \sum _{i=1}^{3}\alpha _{g_i}({\varvec{x}}) \\&= 3{\mathbb {H}}(p(f_{{\varvec{x}}} \vert D_n, {\varvec{x}})) - {\mathbb {H}}(p(f_{{\varvec{x}}} \vert D_n, {\varvec{x}}, f_{{\varvec{x}}}<\alpha )) \\&- {\mathbb {H}}(p(f_{{\varvec{x}}}\vert D_n, {\varvec{x}}, \alpha<f_{{\varvec{x}}}<\beta )) \\&- {\mathbb {H}}(p(f_{{\varvec{x}}}\vert D_n, {\varvec{x}}, \beta < f_{{\varvec{x}}})) \end{aligned} \end{aligned}$$
(9)

with intervals defined as:

$$\begin{aligned} g_1=\{-\infty<f_{{\varvec{x}}}< \alpha \},\quad g_2=\{\alpha<f_{{\varvec{x}}}< \beta \},\quad g_3=\{\beta<f_{{\varvec{x}}} < \infty \} \end{aligned}$$
(10)

Now, computing Eq. (9) only involves calculating the differential entropy of a truncated Gaussian distribution. In particular, the following closed form expression holds [5]:

$$\begin{aligned} \begin{aligned} \alpha _{g_2}({\varvec{x}}) =&{\mathbb {H}}(p(f_{{\varvec{x}}} \vert D_n, {\varvec{x}})) - {\mathbb {H}}(p(f_{{\varvec{x}}}\vert D_n, {\varvec{x}}, \alpha<f_{{\varvec{x}}}<\beta )) \\ =&- \text {log}\;(Z) \\&-\frac{1}{2Z} \left[ (\alpha - \mu ({\varvec{x}})) {\mathcal {N}}(\mu ({\varvec{x}}) \vert \alpha , \sigma ^2) - (\beta - \mu ({\varvec{x}})){\mathcal {N}}(\mu ({\varvec{x}}) \vert \beta , \sigma ^2) \right] \end{aligned} \end{aligned}$$
(11)

where \(Z=\varPhi \left[ \frac{\beta - \mu ({\varvec{x}})}{\sigma ({\varvec{x}})}\right] - \varPhi \left[ \frac{\alpha - \mu ({\varvec{x}})}{\sigma ({\varvec{x}})}\right]\), and \(\varPhi\) denotes the cumulative distribution function of the standard normal distribution.

One advantage of this information-theoretic based acquisition function is that it can be conveniently extended to handle multiple constraints problem. Since the derivation of this condition is omitted in [5], we briefly demonstrate it here. Consider the following multi-constraint problem: \({\varvec{x}}\rightarrow {\varvec{f}}\), where the desired feasible region must satisfy all k black box constraints:

$$\begin{aligned} I_A({\varvec{f}}({\varvec{x}}))= {\left\{ \begin{array}{ll} 1&{} \alpha _1 \le f_1({\varvec{x}}) \le \beta _1, \alpha _2 \le f_2({\varvec{x}}) \le \beta _2,\ldots ,\alpha _k \le f_k({\varvec{x}}) \le \beta _k \\ 0&{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(12)

by making use of the mutual information \(I({\varvec{g}}; {\varvec{f}} \vert {\varvec{x}}, \varvec{D_n})\), the acquisition function can be formulated as:

$$\begin{aligned} \begin{aligned} \alpha _{{\varvec{g}}}({\varvec{x}})&= {\mathbb {H}}(p({\varvec{f}}_{{\varvec{x}}} \vert \varvec{D_n}, {\varvec{x}})) - {\mathbb {E}}_{p({\varvec{g}}\vert \varvec{D_n}, {\varvec{x}})}\left[ {\mathbb {H}}(p({\varvec{f}}_{{\varvec{x}}} \vert \varvec{D_n} \cup \{{\varvec{x}}, {\varvec{g}}\}))\right] \\&= \frac{1}{2} \text {ln det} (2\pi e\varSigma ({\varvec{x}})) - {\mathbb {H}}(p({\varvec{f}}_{{\varvec{x}}} \vert \varvec{D_n} \cup \{{\varvec{x}}, {\varvec{g}}\}))\\ \end{aligned} \end{aligned}$$
(13)

\(\varSigma ({\varvec{x}})\) is the covariance matrix between each one dimensional f. Assuming the independence between the different components of \({\varvec{f}}\), Eq. (11) can be written as:

$$\begin{aligned} \begin{aligned} \alpha _{{{\varvec{g}}}_{{\varvec{2}}}}({\varvec{x}}) =&\frac{1}{2} {\text {ln}} (2\pi e)^k\ \text {det} (\varSigma ({\varvec{x}})) - \sum _{j=1}^{k}{\mathbb {H}}(p(f_j \vert D_n \cup \{{\varvec{x}}, g_{j_2}\})) \quad \\ =&\sum _{j=1}^k \frac{1}{2} {\text {ln}} (2\pi e)\sigma _{j}^2 - \sum _{j=1}^{k}\left[ \frac{1}{2} {\text {ln}} (2\pi e \sigma _{j}^2Z_{j}^2) + \right. \\&\left. \frac{1}{2Z_j}[(\alpha _j - \mu _j){\mathcal {N}}(\mu _j \vert \alpha _j, \sigma _{j}^2) - (\beta _j - \mu _j){\mathcal {N}}(\mu _j \vert \beta _j, \sigma _{j}^2 )]\right] \\ =&\sum _{j=1}^k \alpha _{g_{j_2}}({\varvec{x}}) \end{aligned} \end{aligned}$$
(14)

Note that, for simplicity, the symbol \({\varvec{x}}\) is sometimes omitted in Eq. (14).

Once the acquisition function is defined, it can be used in the learning loop (see Fig. 2). Indeed, at each iteration, the GP metamodel is refined with a new expensive simulation, performed for the maximizer \(\bar{{\varvec{x}}}\) of the acquisition function. To estimate \(\bar{{\varvec{x}}}\), the following procedure is adopted: the acquisition function is evaluated at 1000 candidate points chosen via Monte Carlo sampling, then the best candidate point is chosen as the starting point of an L-BFGS-B gradient based optimizer [42]. This optimization process can be performed very efficiently, since the acquisition function \(\alpha ({\varvec{x}})\) depends only on the GP model, which is very cheap to evaluate.

Let us remark that this information theoretic-based acquisition function has demonstrated competitive performance compared with existing active learning techniques for feasible region identification. The interested reader is referred to [5] for a detailed discussion on this topic.

3.3 Framework of metamodel-based stopping criterion

An accurate estimation of the precision, recall and F1 score requires to evaluate f for a large set of samples in the design space. Since evaluating f via simulations is expensive, it is not possible to directly use these metrics as stopping criterion of the proposed active learning strategy. In the following, it is first described how these metrics can be efficiently estimated from a Bayesian perspective, i.e., from the metamodel’s belief. Then, these estimations can be adopted as stopping criteria of the AL process (see Fig. 2). For notation and introduction simplicity, we start with the description of the stopping criterion in case of single constraint function (k = 1), and then we generalize it to multiple constraints.

For a problem defined by a single constraint \(f: {\mathcal {R}}^{m \times d} \rightarrow {\mathcal {R}}^{m \times 1}\), the corresponding feasible region is \(A_f:=\{{\varvec{x}} \in {\mathcal {X}}: \alpha \le f({\varvec{x}}) \le \beta \}\), and the feasible region predicted by the GP metamodel (with posterior mean \(\mu : {\mathcal {R}}^{m \times d} \rightarrow {\mathcal {R}}^{m \times 1}\)) is \(A_\mu :=\{{\varvec{x}} \in {\mathcal {X}}: \alpha \le \mu ({\varvec{x}})\le \beta \}\). Furthermore, let us indicate with \(\text {precision}_\mu \vert f, \alpha , \beta , X^*\) the precision of the metamodel identifying feasible region \(A_\mu\), compared with the actual feasible region \(A_f\) based on test data \(X^*\). Let us use \(\text {recall}_\mu \vert f, \alpha , \beta , X^*\) and \(\text {F1}_\mu \vert f, \alpha , \beta , X^*\) to indicate recall and F1 score estimated by the metamodel, respectively. These three performance measurements depend on the given feasible boundary \([\alpha\), \(\beta ]\) and test data \(X^*: \{X_F^*, X_{IF}^*\} \in {\mathcal {R}}^{n, d}\), where \(X_F^*\) and \(X_{IF}^*\) denotes predicted feasible and not feasible data by posterior mean \(\mu\), respectively, m can be the size of the test data, while d represents the problem dimensionality.

Let us further define a set operator \(I_F [C] = \{c \in C | \alpha<c<\beta \}\), where \(C \subset R^{m \times 1}\), which returns the subset of elements satisfying the constraints. Let us define a similar operator for the metamodel \(I_{A_\mu }(D)= D\cap A_\mu\), which returns the subset of elements of the input data \(D \in {\mathcal {R}}^{m \times d}\) that satisfy the feasible region.

With slight abuse of the notations given above, the precision, recall and F1 score can then be represented as:

$$\begin{aligned} \text {precision}_\mu \vert f, \alpha , \beta , X^*&= \frac{\vert I_{F}[f[I_{A_\mu }(X^*)])] \vert }{\vert I_F [\mu [X^*]] \vert } = \frac{\vert I_{F}[f[X_F^*]]\vert }{\vert X_F^*\vert } \end{aligned}$$
(15)
$$\begin{aligned} \text {recall}_\mu \vert f, \alpha , \beta , X^*&= \nonumber \\&\frac{\vert I_{F}[f[I_{A_\mu }(X^*)])] \vert }{\vert I_{F}[f[I_{A_\mu }(X^*)])] \vert + \vert I_{F}[f[I_{{\mathcal {X}}\backslash A_\mu }(X^*)]] \vert } ' \nonumber \\&= \frac{\vert I_{F}[f[X_F^*]]\vert }{ \vert I_{F}[f[X^*]]\vert } \end{aligned}$$
(16)
$$\begin{aligned} \text {F1}_\mu \vert f, \alpha , \beta , X^*&= 2 \frac{{\text {precision}}_\mu \cdot {\text {recall}}_\mu }{{\text {precision}}_\mu + {\text {recall}}_\mu }\nonumber \\&= 2\frac{\vert I_{F}[f[X_F^*]]) \vert }{\vert I_{F}[f[X^*]]) \vert + \vert X_F^* \vert } \end{aligned}$$
(17)

where \(\vert . \vert\) denotes the number of rows, for instance, \(\vert X^* \vert = n\).

These three performance metrics are difficult to compute, as a large number of samples needs to be evaluated by the expensive simulator as test data. To efficiently make use of these metrics, a Bayesian approach can be utilized to describe the expensive constraint f in Eqs. (1517) with the belief of the GP metamodel: f follows a generative approach from the GP posterior. In this framework, Eqs. (1517) become random variables.

First, we will present an analytical analysis of the precision metric under our setting. Specifically, we provide an unbiased analytical estimation of its mean and an analytical approximation of its variance.

Proposition 1

Given the assumption that the black box function is a random sample path of a known GP posterior: \(f \sim {{\mathcal {GP}}}\) . We have:

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_f(\text {precision}_\mu \vert f, \alpha , \beta , X^*) \\&= \frac{1}{N_F}\sum _{i=1}^{N_F} \left[ \varPhi \left( \frac{\beta -\mu ({\varvec{x}}_{F_i}^*)}{\sigma ({\varvec{x}}_{F_i}^*)}\right) -\varPhi \left( \frac{\alpha -\mu ({\varvec{x}}_{F_i}^*)}{\sigma ({\varvec{x}}_{F_i}^*)}\right) \right] \\&= {\mathbb {E}}_f \left[ \frac{1}{N_F}\sum _{i=1}^{N_F} I_{F}[f({\varvec{x}}_{F_i})]\right] \approx {\mathbb {E}}_f\left[ V(A_{f \subseteq \mu })/V(A_\mu )\right] \end{aligned} \end{aligned}$$
(18)

where \(N_F\) is the size of the predicted feasible data set \(X_F^*\), i.e., \(\vert X_F^* \vert\). \({\varvec{x}}_{F_i}^*\) denotes a specific point i in \(X_F^*\), \(V(\cdot )\) denotes the volume of a region, \(A_{f \subseteq \mu }\) denotes the feasible region within bounded space \(A_\mu\)

A generalized proof of proposition 1 (which is applicable in case of multiple constraints) is given in Appendix (Sect. 1.1). Proposition 1 explains three properties of the Bayesian treatment of the precision metric. First and foremost, the precision metric formulated by comparing the posterior mean and GP posterior that contains the black box function f is a random variable, and its expectation equals the second line of Eq. (18), which provides an unbiased analytical expression of this expectation. Second, the last line of Eq. (18) reveals the connection of the precision metric with the excursion search method [20], which tries to provide an accurate estimation of the probability of failure within a given design space \({\mathcal {X}}\). This brings an intuitive geometric interpretation of this equation: with the increase of the test data, the analytical expression will eventually converge to the expected feasible volume ratio within the model predict feasible space \(A_\mu\). Third, based on the second property, the last line of Eq. (18) also provide a justification that one can use any test data when applying the metamodel, with Eq. (18) still as a meaningful reference even though it is originally conditioned on probably different test data \({\varvec{X}}^*\). Since as long as these two datasets are condense distributed in the design space, applying Eq. (18) on different test data sets approximate the same geometric interpretation.

We now demonstrate an approximation of the variance of the precision metric, a generalized proof of which is given in Appendix (Sect. 1.2).

Proposition 2

Based on the same assumptions and notations of proposition 1 . The variance of the precision metric can be approximated as:

$$\begin{aligned} \begin{aligned}&{\mathbb {V}}_f(\text {precision}_\mu \vert f, \alpha , \beta , X^*) \approx \frac{1}{N_{F}^2}\sum _{i=1}^{N_F} \left[ W(\varvec{{\varvec{x}}_{F_i}^*})- W(\varvec{{\varvec{x}}_{F_i}^*})^2\right] \end{aligned} \end{aligned}$$
(19)

where \(W({\varvec{x}}_{F_i}^*):=\left[ \varPhi \left( \frac{\beta -\mu ({\varvec{x}}_{F_i}^*)}{\sigma ({\varvec{x}}_{F_i}^*)}\right) -\varPhi \left( \frac{\alpha -\mu ({\varvec{x}}_{F_i}^*)}{\sigma ({\varvec{x}}_{F_i}^*)}\right) \right]\) , the equality can be taken when the conditional independence of GP posterior prediction holds, i.e., \(\mathrm{{cov}}({\varvec{Y}}_* \vert D_n, {\varvec{X}}_*)\) is diagonal.

The benefits of using Bayesian perspective to describe the precision metric (as well as the recall and F1 score) is twofold. First, the Bayesian description of f by the GP posterior has circumvented the use of precalculated expensive test data. Second, the reformulation of the performance metric as random variables has also allowed the calculation of its variance, which is of practical interest since its confidence bound can be extracted later.

Unfortunately, similar analytical forms do not exist for the recall and F1 score metric. Hence, to make use of these metrics as stopping criteria, following the same Bayesian treatment, one can sample the possible f from the GP posterior to calculate the possible F1 scores, as shown in Algorithm 1. This new algorithm, referred to as F1-sample, is used in this paper to describe to what extent the feasible region has been modeled.

figure a

In case the feasible region is defined by multiple (k in total) black box constraints: \(A := \{{\varvec{x}} \in {\mathcal {X}} : \bigcap \nolimits _{j=1}^{k} \alpha _j< f_j({\varvec{x}})<\beta _j\}\), a histogram of sampled F1 score could be evaluated for each one of the constraints. However, this does not provide an overall accuracy of the prediction model.

Hence, an aggregated F1 score is proposed in the following. We first introduce some additional definitions to properly describe the criterion for multi-constraint problems from a practical implementation perspective. Let us define a set operator \(I_{{\varvec{A}}_\mu }(D)= D\cap _{i=j}^{k} A_{\mu _j}\), which returns the subset of elements of input data \(D \in {\mathcal {R}}^{m \times d}\) that are in the joint feasible region. Let \(I_{F_j}' [\cdot ]: {\mathbb {R}}^{m} \rightarrow \{0, 1\}^m\) denote whether function values are within a feasible region j, while the operator \(\vert \cdot \vert _1\) returns the element number of 1. Assuming the same bounded design space \({\mathcal {X}}\) for all functions and using the same test data \(X^*\) for each of the constraints f, then the predictions of these models can be aggregated into a single final model by reformulating the F1 score as a random variable:

$$\begin{aligned} \begin{aligned} \text {F1}_{\varvec{\mu }} ({\varvec{f}}) \vert \varvec{\alpha }, \varvec{\beta }, X^* = 2\frac{ \vert \prod _{j=1}^k I_{F_j}' [f_j(I_{{\varvec{A_\mu }}}[{X^*}])]) \vert _1}{\vert \prod _{j=1}^k I_{F_j}' [f_j[X^*]]) \vert _1 + \vert X_F^* \vert } \end{aligned} \end{aligned}$$
(20)

As this expression cannot be trivially decomposed, we can also sample its distribution as in Algorithm 2.

figure b

The recall metric can also be sampled by following a similar procedure. Note that the proposed strategy is also applicable to any other suitable probabilistic models, while in this research, we restrict our discussion within the framework of Gaussian process.

Since the proposed F1-sample measure heavily depends on sampling from the Gaussian process defined in Eq. (5), it is also worthwhile to describe the computational complexity of calculating the proposed stopping criterion. For the general multi-constraint case, the computational complexityFootnote 4 in case of exact sampling (A. 2 of [38]) is \({\mathcal {O}}(n^3Nk)\), where \(n=|X^*|\) represents test sample numbers, N is the path sample number of GP posterior and k denotes the number of constraints. More efficient F1 score estimation is a topic of future research, we refer to [44,45,46,46] for further discussion.

To utilize the samples from Algorithm 1 and 2 as an indicator of the metamodels’ prediction accuracy, the sampled F1 score (or recall) can be averaged as the unbiased estimator of Eq. (20’s) mean, which is referred to as \(\overline{\text {F1-sample}}\). Meanwhile, as the F1-sample measure relies on the assumption that the real function f follows the generation of a GP (Proposition 1), in reality this assumption can (possibly) result in either overestimation or underestimation of the real F1 score. Thus, to provide a more robust estimation, several different strategies can be adapted. For instance, instead of using a fixed test data set, it can be randomly generated for each F1 sample. On the otherhand, one can employ a fully Bayesian approach by placing a prior distribution on the kernel hyperparameters and sample the posterior GP to estimate the F1 score. Moreover, one can take a step further by sampling from a mixture of possible kernels (e.g., [47]). Nevertheless, such approaches will result in much more extra computational effort within the current AL framework. Hence, in order to efficiently make use of the proposed F1-sample metric as a stopping criterion, as shown in Fig. 2, a conservative approach can be adopted by investigating when the metamodel is 95% confident that its predicted F1 score is larger than an acceptable F1 score, which can be obtained from the 5th percentile of the F1-sample as a point estimation. Once this condition is continuously reached for several iterations (e.g., 5 iterations are used in this paper), it is suitable to stop the AL iterations.

4 Numerical examples and application examples

First synthetic experiments are conducted to validate the \(\overline{\text {F1-sample}}\) metric, i.e., to investigate to what extend it is able to capture the real F1 score. Afterwards, applications to Electronic Design Automation (EDA) examples are presented, where the feasible region identification is of practical interest during design space exploration. In particular, an antenna and a distributed filter are studied, both operating in the microwave spectrum. Different design constraints are applied to their frequency response, simulated using ADS Momentum [48]. In the first EDA example, the feasible region is defined by a single constraint condition on the device Return Loss (RL). In the second EDA example, a feasible region determined by multiple constraints must be identified, based on the filter operating bandwidth. Both geometrical parameters and material properties are used as design parameters. Microwave designers can find feasible solutions for different manufacturing processes or layout techniques.

All the experiments are conducted on a computer with Intel(R) Core(TM) i7-8650u @ 1.9GHz processor and 16 GB RAM. The AL framework is implemented using GPFlowOpt [49]. Regarding the GP hyperparameters, the mean function m is chosen equal to zero, while the automatic relevance detection (ARD) [38] version of the squared exponential (SE) kernel is employed. The kernel hyperparameters are determined using maximum likelihood estimation with an L-BFGS-B optimizer. For the test data \(X^*\) of F1-sample metric, we use 10,000 random samples. Then, to calculate the F1-sample score, \(N=500\) function samples are drawn from the posterior GP. Finally, 20 experiments are conducted independently to provide robust results.

4.1 Numerical examples

The numerical experiments are conducted on seven different benchmark functions, described in Table 1 along with their feasible regions. To give an indication of the difficulty of the problems studied here, the feasible ratio and the feasible variance ratio are also reported in Table 1. The feasible region ratio indicates the ratio between the volumes of the feasible region and the total bounded design space, and it is defined as:

$$\begin{aligned} R_\mathrm{{fr}} = V(A)/ V({\mathcal {X}})\ (\%), \end{aligned}$$
(21)

where the operator \(V(\cdot )\) calculates the volume of a specified region. The feasible variance ratio is the function variance within the feasible region over the variance in the design spaces, and is defined as:

$$\begin{aligned} \begin{aligned}&R_\mathrm{{fv}} = \varDelta \{f \vert A\}/\varDelta \{f \vert {\mathcal {X}}\} \ (\%)\\&\text {where}\ \varDelta \{f \vert {\mathcal {X}}\} = \underset{{\varvec{x}} \in {\mathcal {X}}}{\text {max}} f - \underset{{\varvec{x}} \in {\mathcal {X}}}{\text {min}} f\\&\quad \quad \ \ \ \varDelta \{f \vert A\} = \underset{{\varvec{x}} \in A}{\text {max}} f - \underset{{\varvec{x}} \in A}{\text {min}} f \end{aligned} \end{aligned}$$
(22)
Table 1 Benchmark functions settings

4.1.1 Assessment of stopping criterion on random distributed data

The accuracy of the estimated F1 score by \(\overline{\text {F1-sample}}\) is assessed by training a one-shot GP model on random data sets with an increasing number of samples T. Adopting random samples to build the GP, makes it difficult to accurately model the feasible region, especially in cases when the feasible ratio is small: it might happen that no point is located in the feasible region. This allows a fair evaluation of the F1-sample score even when the metamodel is less confident. The F1-sample score is compared against the deterministic F1 score estimated on a random test data set of 10,000 samples.

As a comparison, we also investigate the performance of the IMSE\(_T\), which has widely been applied as an accuracy measurement of the feasible region within the design space:

$$\begin{aligned} \text {IMSE}_T = \int _{\mathcal {X}} s_k^2({\varvec{x}}) W({\varvec{x}})\mathrm{{d}}{\varvec{x}} \end{aligned}$$
(23)

where the so-called interval weight: \(W({\varvec{x}}) = \varPhi (\frac{\beta - \mu ({\varvec{x}})}{\sigma ({\varvec{x}})}) - \varPhi (\frac{\alpha - \mu ({\varvec{x}})}{\sigma ({\varvec{x}})})\) (Eq. 21 of [25]) is the probability that a response \({\varvec{x}}\) is inside the interval.

To evaluate the performance of the metric estimation, two criteria will be considered. First, a good prediction accuracy measurement should follow the same trend as the test data, based on a suitable evaluation metric, which is the F1 score in our setting. This can be retrieved from the linear correlation between the stopping criterion and the F1 score. Moreover, the deviation of the estimation should be sufficiently small with respect to the F1 score, since either overestimating or underestimating the F1 score will provide misleading results to designers.

Table 2 Benchmark results of random sampling
Fig. 3
figure 3

Deviation progress plot of random sampling: The deviation of averaged stopping criteria predicted verses F1 score is shown with the increase of F1 score, a positive deviation means an overestimation of the F1 score; a negative deviation means an underestimation

Results in the form of linear correlation are provided in Table 2. The deviation with respect to the F1 score is also provided in Fig. 3. This progress plot is constructed by linearly interpolating the F1 score versus the F1 score predicted by the two metrics. Since the metamodel is built by random sampling, all the experiments are independent, hence the F1 score prediction is only reported with its mean. One immediate advantage which can be seen here is that the IMSE\(_T\) metric, while also measures the metamodel’s accuracy, cannot be used to describe the F1 score directly, as it is unscaled and negatively correlated with the F1 score. Thus a scaled IMSE\(_T\) is used here for fair comparison and denoted as \(\text {IMSE}_T^+:=1 - \text {IMSE}_T/\max (\text {IMSE}_T)\). Note that, this comparison is not performed for the Curved benchmark function due to the small volume of its feasible region, which requires an unaffordable number of samples to provide an accurate metamodel. For all the other benchmark functions, it can be shown that the proposed \(\overline{\text {F1-sample}}\) clearly provides a better correlation with the F1 score than the \(\text {IMSE}_T^+\). Taking a look at the deviation progress plot in Fig. 3, both stopping criteria converge when the model accuracy improves, i.e. when the F1 score approaches to 1. Nevertheless, \(\overline{\text {F1-sample}}\) has a smaller deviation for almost all the processes.

Fig. 4
figure 4

Progress plot of the active learning process: The deviation of the stopping criteria versus the F1 score is shown. A positive deviation means an overestimation of the F1 score; a negative deviation means an underestimation. The shaded areas with black edge represent 2\(\sigma\) of \(\overline{\text {F1-sample}}\) metrics

4.1.2 Assessment of stopping criterion in active learning

To further demonstrate the advantage of F1-sample in the AL schemes, experiments for active feasible region identification are conducted, now based on the proposed entropy-based acquisition function. The initial data are generated by max–min latin hypercube design (LHD). The experiments starting and stopping sample numbers, as well as the corresponding results are reported in Table 3. For some benchmark functions like the Goldenstein-Price function, low correlation is shown by both metrics. The main reason is that the feasible region is especially difficult to model in this case, as the function variance is considerably larger than the specified feasible region variance. Additionally, in earlier stages the metamodel accuracy is not high enough for the criterion to closely follow the F1 score trend, which changes along with the learning steps. This issue will gradually disappears as the model accuracy grows.

As a complementary benchmark, a multi-constraint experiment is executed. Since there is no trivial extension of IMSE\(_T\) to this setting it is not compared here and reported as not applicable (NA). Results with random sampling and active learning are also provided in Table 3 and Fig. 4. A high correlation can be observed consistently.

Table 3 Benchmark results of active learning

4.2 Design of a microstrip patch antenna

Fig. 5
figure 5

Microstrip patch antenna geometry and design parameters

We now demonstrate the usage of the proposed AL framework in the electronic design of a microstrip antenna, which consists of a metal patch on a dielectric substrate, with a ground metal layer on the backside side. The patch geometry is shown in Fig. 5, while the design parameters can be found in Table 4. The feed line for the signal is matched with a 50 \(\Omega\) impedance at 2.4 GHz. The antenna is simulated in the range [2–3] GHz with ADS Momentum, and a typical simulation requires about 45 seconds. Thus, the objective function is the frequency for which the Return Loss (RL) is minimal. If this value is included in [2.35, 2.45] GHz, the design is considered feasible. The mathematical notation is shown as follows:

$$\begin{aligned} \begin{aligned}&\text {Find}\ A = \{{\varvec{x}} \in {\mathcal {X}} \vert I_A(f({\varvec{x}})) =1 \}\\&\text {where}\ I_A(f({\varvec{x}}))= {\left\{ \begin{array}{ll} 1&{} 2.35 \le f({\varvec{x}}) \le 2.45 \\ 0&{} \text {otherwise} \end{array}\right. }\\&\quad \quad \quad f({\varvec{x}}):= \underset{\zeta \in [2,3]}{\text {argmin}}({\text {RL}}_{{\varvec{x}}}(\zeta )) \end{aligned} \end{aligned}$$
(24)

where \({\text {RL}}_{{\varvec{x}}}(\cdot )\) denotes the RL curve for a given design \({\varvec{x}}\), \(f({\varvec{x}})\) denotes the RL curves minimum between frequency \(\zeta\) range [2, 3] GHz.

Table 4 Microstrip patch antenna design parameters range
Fig. 6
figure 6

Progress plot of active feasible region identification for the microstrip patch antenna. Sampled F1 histogram are shown at different iterations, where the vertical dashed line denotes the 5th percentile of the histogram

An initial GP model is built using 30 initial samples, which are generated by LHD. The progress plot of the predicted F1 score is shown in Fig. 6. It can be seen that the F1-sample keeps increasing with the number of iterations. The sampled F1 score histogram with its 5th percentile are also shown at three different stage of the AL process. By applying the proposed stopping criterion, the six dimensional AL process has stopped in 71 iterations, and the whole AL process lasts about 62 minutes.

To evaluate the final learned GP’s performance, 3000 antenna configurations are randomly generated in the design space and the final GP metamodel is adopted to predict their feasibility. Only 372 have been predicted as feasible. We evaluated these predicted feasible designs performance by the ADS Momentum simulator, the results of RL curves are shown in Fig. 7. The final learned GP metamodel has a precision score of 98.12%, a F1 score of 96.69%.

Fig. 7
figure 7

Microstrip patch antenna: RL of the 372 designs predicted as feasible by the metamodel. Among these, 365 are true positive (in green), while 7 are false positive (in red). Besides, 18 of 2628 predicted infeasible designs are false negative. Note that both the false positive and false negative are near the edge of the feasible region

4.3 Design of a double-folded microstrip band-stop filter

Fig. 8
figure 8

Double-folded microstrip band-stop filter geometry and parameter design parameters

The second application example is a double-folded microstrip band-stop filter [53], formed by a dielectric substrate between a top metallization and a bottom ground plane. As illustrated in Fig. 8, the geometry of the upper layer consists of two stubs, with identical length and spacing, folded onto the two sides of a transmission line. Here, the proposed AL framework can be applied to identify all possible parametric configurations for which the band-stop presents the desired width and location. Hence, the feasible region is defined by the following conditions:

$$\begin{aligned}&\mathrm{{BW}} \le 7\,\text {GHz}; \quad 13.5 \le f_0 \le 14.5\, \text {GHz} \end{aligned}$$
(25)
$$\begin{aligned}&\mathrm{{BW}} = f_{H} - f_{L} ; \quad f_0 = \frac{f_H + f_L}{2} \end{aligned}$$
(26)

where BW is the bandwidth and \(f_0\) is the central frequency. The filter response is computed at 101 frequency points over the frequency range [5–25] GHz. Then, BW and \(f_0\) can be defined w.r.t. the higher and lower band-stop frequencies (\(f_H\), \(f_L\)), which correspond to − 3 dB of magnitude response.

Table 5 Double-folded microstrip band-stop filter design parameters range

The design space has four parameters as shown in Table 5, and the evaluation of a single design configuration takes about 42 s. The initial GP model is built upon 10 initial samples from a LHD and updated by the proposed entropy search-based method. The corresponding progress plot are depicted in Fig. 9, showing a rapid convergence of the F1-sample score for this multi-constraint problem. The AL process is halted when F1-sample is continuously larger than 0.98 for 5 consecutive iterations. The overall AL process takes about 14 minutes.

The accuracy of the computed metamodel is verified by comparison with 1000 ADS simulations. The results are shown in Fig. 10, indicating a precision score of 98.22%, a F1 score of 98.08%.

Fig. 9
figure 9

Progress plot of active feasible region identification for the double-folded microstrip band-stop filter. Sampled F1 histogram are shown at different iterations, where the vertical dashed line denotes the 5th percentile of the histogram

Fig. 10
figure 10

Double-folded microstrip band-stop filter: the magnitude of the element \(S_{21}\) of the scattering matrix (representing the frequency response of the filter) for the 338 designs predicted as feasible by the metamodel from 1000 randomly distributed designs. Among these, 332 are true positive (in green), while 6 are false positive (in red). Besides, 7 of 668 predicted infeasible designs are false negative

5 Conclusion

An information-theoretic based active learning method has been introduced to sequentially learn an accurate Gaussian process metamodel for generating feasible design candidates. In particular, feasible regions can be identified with an efficient, robust and automated procedure, where multiple constraints can be specified at the same time. Furthermore, a novel AL stopping criterion has been proposed, based on the probabilistic description of the prediction accuracy. The proposed framework has been applied to two real-world electromagnetic design problems, a six-dimensional design space with a single constraint, and a four-dimensional problem with two constraints. In both applications, the method is able to reach a high-precision representation. Moreover, the number of samples needed to compute the model is minimized thanks to a novel sampling strategy, as well as the stopping criterion which allows for online supervising the metamodel’s belief about the feasible region, thus stopping the algorithm efficiently. This indicates that the model is suitable to simulate the behaviour of the real system on a confined region, and it can be successfully employed to generate a large number of good design candidates in complex engineering problems.

There are still several limitations on proposed algorithm. For the F1-sample itself, the stopping criterion depends on numerical integration in the input domain, which makes it difficult to apply it in high dimensional problems. Moreover, the sampling process used to approximate the distribution requires additional computation. Future research will be focused on overcoming these limitations.