1 Introduction

Supervised machine-learning techniques have been re-emerging as a promising avenue for data-driven modeling in various engineering disciplines (Venkatasubramanian 2019). In most applications, the overall goal is the optimal decision-making based on available data and a priori knowledge. Thus, data-driven models and mechanistic models are often combined to form hybrid models (Mogk et al. 2002; Kahrs and Marquardt 2008; Von Stosch et al. 2014; Glassey and Von Stosch 2018). Subsequently, hybrid models are frequently used in model-based optimization of design and/or operation of processes (McBride and Sundmacher 2019; Schweidtmann and Mitsos 2019).

A critical issue of data-driven models is their limited extrapolability. Unless strong assumptions are posed on the learned function, data-driven models can only be valid in regions where they have sufficiently dense coverage of training data points. We refer to this as the validity domain (Leonard et al. 1992; Courrieu 1994). Consequently, there is a need to avoid the evaluation of data-driven models outside their validity domain during optimization. Note that we refer to the validity domain of individual data-driven models throughout this work, but the concept can also be applied to hybrid models (Kahrs and Marquardt 2007).

The vast majority of previous publications use box constraints (i.e., hyperrectangles) to bound the inputs of data-driven models, i.e., each variable has independent bounds. This approach is practical when the training data is obtained from simulations based on regular grids or Latin hypercubes that are sufficiently dense. It is also advantageous for local and global optimization. However, it requires a priori known bounds and the possibility to obtain training data for any input combination. In practice, simulations can fail (Asprion 2020) and industrial process data usually does not cover hyperrectangular spaces (Asprion et al. 2019). This leads to manual selection of wrong bounds, which may cut off optimal solutions or overestimate the validity domain.

As proposed by Courrieu (1994), a few previous works in process systems engineering (PSE) constructed the convex hull of the training data points to describe the validity domain and integrated it as a set of linear constraints in optimization problems (Kahrs and Marquardt 2007; Zhang et al. 2016; Asprion et al. 2019). By definition, the convex hull is the smallest convex set that contains all data points. Commonly, evaluations of data-driven models inside the convex hull of the training data are called interpolation and outside extrapolation. However, roughly speaking, the convex hull cannot distinguish between for potential holes in the training data set, gaps between separated clusters of data, and nonconvex boundaries. Thus, staying within the convex hull seems only a necessary condition for the validity of data-driven models and not sufficient. Identifying if the convex hull is a suitable model for the data domain is very challenging in high dimensions. Notably, Zhang et al. (2016) extended the convex hull method to the union of multiple polytopes by introducing binary variables and additional constraints to the problem. However, this algorithm becomes impractical when the number of data points or their dimension is very high. Besides convex hull formulations, there exist also several publications that circumvent extrapolation problems en passant in different ways. Inspired by earlier works by Leonard et al. (1992) and Simutis et al. (1995), Teixeira et al. (2006) compute clusters using k-nearest neighbor algorithm and constrain the distance to the cluster centers. Likewise, Rall et al. (2019) constrain the maximal allowed distance from the nearest training data point resulting in a nonsmooth optimization problem. Mistry et al. (2018) penalize deviations from a training data mean in a space that is parameterized using principal component analysis. Kumar et al. (2019) train multiple data-driven models on a design problem and reject designs where the variation between the models is large. Similarly, Pinto et al. (2019) use bootstrap aggregation to estimate error bounds for hybrid mechanistic/data-driven models. There exist further methods that quantify the variance or confidence interval of predictions such as Bayesian methods and maximum likelihood estimations (Papadopoulos et al. 2001). However, this leads to chance-constrained programming problems (Charnes and Cooper 1959; Schweidtmann et al. 2020a). Likewise, there are a few studies on the adaptive exploration of the design space (Teixeira et al. 2006; Larson and Mattson 2012; Chen et al. 2018; Knudde et al. 2019) and related works on constrained Bayesian optimization (Shahriari et al. 2016). Moreover, there exist closely related contributions for the (adaptive) identification of process feasibility and flexibility using data-driven approaches which generate a single feasibility function (Boukouvala and Ierapetritou 2012; Bhosekar and Ierapetritou 2018). However, we focus on fixed training data sets in this study while the extension of our method to adaptive sampling is a promising future research.

An alternative to box constrains and convex hull is to use a nonlinear classifier that can also model complicated validity domains. A few previous studies in mechanical engineering (Malak and Paredis 2010; Roach et al. 2011) used Support Vector Domain Description (SVDD) (Tax and Duin 1999) to model the validity domain of data-driven equipment models. Also, Quaglio et al. (2018) use binary support vector classification to include reliability constraints into model-based design of experiment. As only valid training data points are given in most engineering applications, we consider one-class classification in this work. One-class classification is an unsupervised machine learning technique. There exists a broad variety of one-class classifiers that can be divided into density methods, boundary methods, and reconstruction methods (Tax 2001). Also, one-class classification is closely related to novelty, outlier, or anomaly detection (Chandola et al. 2009; Pimentel et al. 2014; Khan and Madden 2009, 2014; Ding et al. 2014). The previous literature indicates that one-class support vector machines (SVMs) (Schölkopf et al. 2000) are common and suitable for the problem at hand. Compared to density models, less training data is required to construct the boundary, since only the boundary is estimated and not a complete density distribution (Tax 2001). In addition, the one-class SVM is tolerant to outliers in the training set (Pimentel et al. 2014).

Optimization problems with one-class SVMs embedded are nonconvex. Thus, deterministic global optimization is desirable to identify global solutions. However, these models lead to large-scale optimization problems that are difficult to solve. In our previous work, we showed that a reduced-space (RS) formulation and the use of McCormick relaxations are advantageous for the optimization of two other important classes of data-driven models, namely artificial neural networks (Schweidtmann and Mitsos 2019) and Gaussian processes (Schweidtmann et al. 2020a). We propose a similar idea here for one-class SVM.

The global shape of data matters because it often provides important information about the underlying phenomena represented by the data. Especially in high-dimensional data, topological data analysis (TDA) can reveal and quantify objects and features not directly visible to the human eye. In the context of this work, it provides valuable information about the topology of the training data that can be colloquially thought of as holes or separated clusters. TDA was initiated relatively recently (Letscher et al. 2002; Zomorodian and Carlsson 2005). Its roots lie in applied (algebraic) topology and computational geometry (Chazal and Michel 2017) and it is commonly used to account for higher-order interactions in data, to comprehend mesoscale structures, or to compare different data spaces (Patania et al. 2017). The most common TDA method is persistent homology (Wasserman 2018). So far, there are only a few applications of persistent homology in the fields of (bio-)chemical engineering and material science (Hiraoka et al. 2016; Saadatfar et al. 2017; Xia 2018; Xia et al. 2019; Smith et al. 2020).

We propose a three-step approach to model the validity domain of data-driven models for optimization. We first perform TDA using persistent homology. In case clusters or holes are identified, we train a one-class SVM on the training data domain of the data-driven models and encode it as constraints in the subsequent process optimization. Otherwise, we construct the convex hull of the data and encode it as constraints. We finally perform deterministic global process optimization with the data-driven models and their respective validity constraints. To ensure computational tractability, we develop a RS formulation for trained one-class SVMs. Moreover, we employ convex and concave envelopes of kernel functions to accelerate optimization. We demonstrate the potential of our method on a set of illustrative mathematical case studies and an engineering case study, i.e., the open-loop control of a sulfur recovery unit.

2 Methodology

As illustrated in Fig. 1, we propose a three step approach to obey validity limits of data-driven models during optimization. In the first step, we conduct a TDA of the training data. In the second step, we either construct the convex hull of the data or we train a one-class classifier, i.e., a SVM. In the third step, we embed the trained classifier or convex hull in the optimization problem and solve it to global optimality. The described methods are available open-source. We use the Ripser.py toolbox that is available open-source under MIT license in Python for performing the TDA (Tralie et al. 2018). The training of the one-class SVM is performed by Scikit-learn (Pedregosa et al. 2011) and the convex hulls are identified using SciPy (Virtanen et al. 2020). We provide the one-class SVM within the “MeLOn - Machine Learning Models for Optimization” toolbox under the Eclipse public license (Schweidtmann et al. 2020b). The resulting optimization problems are solved using our open-source global solver MAiNGO (Bongartz et al. 2018).

Fig. 1
figure 1

Overview of the proposed three step methodology to obey validity limits of data-driven models in optimization

2.1 Topological data analysis using persistent homology

In persistent homology, we are interested in so-called topological invariants, i.e., properties that are invariant under homeomorphisms. The topological invariants of interest are homology groups, i.e., H\(_k\) of dimension k, with \(\beta _k = \dim (H_k)\) being the Betti numbers (Binchi et al. 2014; Chung et al. 2015). “Informally, \(\beta _0\) is the number of connected components, \(\beta _1\) is the number of two-dimensional holes or “handles” and \(\beta _2\) is the number of three-dimensional holes or “voids” etc.” (Binchi et al. 2014).

The topological invariants are computed by representing the original dataset, i.e., a point cloud, as a simplicial complex through a simplicial filtration. We utilize the common Vietoris-Rips filtration, where a n-simplex in the simplicial complex is formed if and only if the pairwise distance between all points in the n-simplex is at most \(\epsilon \). At the bottom of Fig. 2, we show a series of simplicial complexes for an illustrative point cloud.

Fig. 2
figure 2

Illustration of a Vietoris–Rips filtration utilized for persistent homology. The upper part shows the data set and circles around the data points with increasing diameter \(\epsilon \). The bottom image illustrates the simplicial complexes formed during the filtration. The figure is based on Kimura and Imai (2017)

Persistent homology studies topological invariants that persist over multiple length scales (\(\epsilon \)) in the data (Chambers and Letscher 2018; Otter et al. 2017; Xia 2018; Xia et al. 2019). In other words, we examine the lifespan of topological invariants by increasing \(\epsilon \) incrementally and constructing simplicial complexes. At the bottom of Fig. 2, we can observe that \(\beta _0=10\) connected components (H\(_0\)) exist at \(\epsilon _1\), \(\beta _0=5\) connected components exist at \(\epsilon _2\), \(\beta _0=1\) connected component and \(\beta _1=1\) two-dimensional hole (H\(_1\)) exist at \(\epsilon _3\), and \(\beta _0=1\) connected component exist at \(\epsilon _4\).

The results of the persistent homology can be depicted in barcode diagrams or persistent diagrams. We use the more common persistent diagrams in this work. The coordinates of birth and death of the homology groups in the example are shown in the persistent diagram in Fig. 3. The x-axis represents the \(\epsilon _{\text {birth}}\) while the y-axis the \(\epsilon _{\text {death}}\) distance of H\(_0\) and H\(_1\) homology groups. Features with long lifespan correspond to points far from the diagonal (Wasserman 2018). The blue triangle at the bottom left corner of the plot corresponds to the merge of two very close data points at small \(\epsilon \). The blue triangles with \(\epsilon _{\text {death}}\) between 1 and 1.5 in Fig. 3 resemble the merge of connected components between \(\epsilon _2\) and \(\epsilon _3\) in Fig. 2, describing the decrease in Betti number \(\beta _0\) from 5 to 1. The highest blue triangle illustrates that one connected component exists until infinite. The red circle represents homology group H\(_1\) and demonstrates the birth and death of the two-dimensional hole which is formed around \(\epsilon _3\) and dies before \(\epsilon _4\) in Fig. 2.

Fig. 3
figure 3

Persistent homology plot of the illustrative point cloud. The x-axis shows the birth and the y-axis the death of the homology groups

In this example, the persistent diagram shows that a hole exist providing useful insight to guide the decision process for model selection. For example, the \(\epsilon _{\text {death}}\) of the H\(_0\) components provide information about the data density. In the example, the maximal distance in the last H\(_0\) component is at most 1.5. This is significantly smaller than the \(\epsilon _{\text {death}}\) of the H\(_1\) hole. In other words, the life span of the hole is characteristic for the dataset. Thus, persistent homology can provide valuable information about the topology of the training data of data-driven models. In particular, it can identify holes or separate data clusters. However, it cannot differentiate clearly between convex and nonconvex boundaries (see also illustrative examples in Sect. 3).

2.2 Learn validity domain using one-class support vector machines

We model the validity domain using the convex hull or one-class SVM approach. The one-class SVMs are trained using the open-source python implementation in Scikit-learn (Pedregosa et al. 2011) and the convex hulls are identified using the open-source implementation in SciPy (Virtanen et al. 2020). The details of the one-class SVM are described in the following.

SVMs are a popular method for binary classification (Cortes and Vapnik 1995) and regression (Smola and Schölkopf 2004). One-class SVMs are a modification of these classical SVMs (Schölkopf et al. 2000) (c.f. Tax (2001) on similarity to SVDD). The goal is to learn a boundary of a given set of training points \({X} = \{\hat{\varvec{x}}^{(1)}, \ldots ,\hat{\varvec{x}}^{(i)},\ldots , \hat{\varvec{x}}^{(N)}\}\) with \(\hat{\varvec{x}}^{(i)} \in \mathbb {R}^{D}\). Similar to classical SVMs, the data is mapped to high-dimensional feature space by \(\phi : \mathbb {R}^{D} \mapsto \mathbb {R}^{d}\) with \(d>> D\) and later solved in the dual formulation using the kernel trick (Schölkopf 2001). In the feature space, a maximum-margin hyperplane is found that separates the data from the origin by solving:

(1)
(2)
(3)

where \(\nu \in (0,1) \) is a regularization hyperparameter, \(\xi _i \in \mathbb {R}\) are slack variables, and \(\varvec{w}\) and \(\rho \) are the parameters of the hyperplane in high-dimensional feature space. Schölkopf et al. (2000) show that \(\nu \) is an upper bound on the fraction of outliers and a lower bound on the fraction of support vectors in the training set, which is known as the \(\nu \)-property. The decision function \(f_{\text {DF-P}}(\varvec{x}) = \varvec{w}^T\phi (\varvec{x})-\rho \) is positive if a candidate point \(\varvec{x}\) is classified to be within the training data domain and negative if not. The dual of (1)-(3) is the quadratic program:

(4)
(5)

where \(\alpha _i\) are dual variables and \(K(\cdot ,\cdot )\) is a kernel function. Often, the radial basis kernel function \(K(\varvec{x},\varvec{y}) = \exp (-\gamma \Vert \varvec{x}-\varvec{y}\Vert ^2)\) with hyperparameter \(\gamma \) is used since it has been shown that it is best able to model the most complex boundaries (Tax 2001). It holds that \(\alpha _i = 0 \) for training samples inside the learned boundary and \(\alpha _i > 0 \) for samples on or outside the boundaries. Samples for which \(\alpha _i > 0\) are called support vectors. The decision function in the dual variables is given by \(f_{\text {DF-D}}(\varvec{x}) = \sum _{i \in I_{\text {sv}}} \alpha _i K(\hat{\varvec{x}}^{(i)}, \varvec{x}) -\rho \), where \(I_{\text {sv}}\) denotes the indexes of the support vectors in the training data (i.e., the data points with corresponding \(\alpha _i > 0\)). To obey validity limits of data-driven models in an optimization problem, the following inequality has to hold:

$$\begin{aligned} f_{\text {DF-D}}(\varvec{x}) \geqslant 0 . \end{aligned}$$
(6)

The parameter \(\nu \) can be estimated from an outlier fraction by using the aforementioned \(\nu \)-property. This makes this method more tolerant to outliers in the training data (Pimentel et al. 2014). The hyperparameter \(\gamma \) controls the model complexity when using the radial basis kernel. If a large \(\gamma \) is used, all samples are mapped to a small region in the feature space and the one-class SVM cannot distinguish between the samples well. In other words, the model lacks complexity. If \(\gamma \) is small, pairs of samples become orthogonal in the feature space. This leads to overfitting and a high number of support vectors. A common approach to identify an appropriate \(\gamma \) is to gradually decrease \(\gamma \) until the number of support vectors does not decrease much (e.g., Dreiseitl et al. 2010). However, automatically selecting an appropriate \(\gamma \) is challenging (e.g.,Evangelista et al. 2007; Xiao et al. 2014a, b).

2.3 Optimization with classifier as constraint

We consider a global optimization problem where a classifier is used to obey validity limits of a data-driven model. In most cases, the inputs of the classifier model correspond to the degrees of freedom \(\varvec{x}\) of the optimization problem. The classifier can determine if a given \(\varvec{x}\) is feasible or infeasible by evaluating its decision function \(f_{\text {DF-D}}(\cdot )\). To obey validity limits, Inequality (6) has to hold. Although the decision function is an explicit function, there exist different ways to formulate it in optimization problems. These problem formulations are equivalent as they have the same solution, but they can have a large impact on the computational performance of global optimization.

In the full-space (FS) formulation, a set of nonlinear equations is provided as equality constraints while the dependent (or intermediate) variables are optimization variables. Note that there exist multiple valid FS formulations depending on the equality constraints and optimization variables provided to the solver. One representative FS formulation for optimization with one-class SVMs embedded is shown in the following:

(7)
(8)
(9)
(10)
(11)
(12)

Herein, Eq. (7) minimizes the objective function value \(z_{\text {obj}}\) that is given by Eq. (8). Note that the objective depends on the variables of the data driven model \(\varvec{z}_{\text {dd}}\) that are given by the solution of Eq. (9). The decision function of the one-class SVM is given by the inequality constraint (10) while its intermediate variables are given by the solution of Eqs. (11),(12). This FS formulation has in total \(D+2\cdot \vert I_{\text {sv}}\vert +\dim (\varvec{z}_{\text {dd}})+ 1\) optimization variables, \(2\cdot \vert I_{\text {sv}}\vert +\dim (\varvec{z}_{\text {dd}})+ 1\) equality constraints, and one inequality constraint.

The equality constraints of the one-class SVM can be solved explicitly for the intermediate variables. Thus, we can directly formulate a RS formulation of the optimization problem (c.f. Bongartz and Mitsos 2017):

(13)
(14)

Herein, \(f_{\text {RS}}(\cdot )\) is the RS formulation of the data-driven model and objective function. Thus, Eq. (13) results from sequential substitutions of Eqs. (7)–(9). This is possible as most data-driven models such as ANNs or GPs are explicit functions (c.f. Schweidtmann and Mitsos 2019; Schweidtmann et al. 2020a) and as the objective function is a function of the degrees of freedom and the predictions of the data-driven model. Similarly, Eq. (14) results from the substitution of Eqs. (10)–(12). The RS formulation has only D optimization variables and one inequality constraint.

The convex hull of a point cloud with a finite number of points can be formulated as a set of linear inequality constraints \({X}_\text {convHull} = \{\varvec{x} \in \mathbb {R}^D~|~\varvec{A}\varvec{x}+\varvec{b} \le \varvec{0}\} \). Assuming that the convex hull has f facets, the matrix \(\varvec{A} \in \mathbb {R}^{f\times D}\) and the vector \(\varvec{b} \in \mathbb {R}^f\) (Kahrs and Marquardt 2007). Thus, the FS and RS formulation of the convex hull are identical. Note that the data-driven model can still be formulated in the RS and FS formulation when using the linear convex hull constraints.

The RS formulation has three major advantages for global optimization: First, the problem formulation has a direct influence on the variables to be branched on. In the RS, the B&B solver branches only on the degrees of freedom \(\varvec{x}\). In the FS, the B&B solver branches on the degrees of freedom and also on the intermediate variables. This is undesirable given the exponential worst-case runtime of global optimization methods. Note that this issue can also be mitigated by selective branching (Epperly and Pistikopoulos 1997). Second, the size of the subproblems that are solved during optimization is affected by the problem formulation and the method for constructing relaxations. Our previous work shows that a combination of McCormick relaxations and RS formulation can reduce the time to solve an iteration of the B&B solver significantly (Schweidtmann et al. 2020a; Bongartz 2020). Third, global optimization solvers usually require bounds on all optimization variables. Often, meaningful bounds are known for degrees of freedom but bounds on intermediate variables can be difficult to determine. Note that this problem is mitigated by some state-of-the-art solvers through automatic bound tightening techniques.

The vast majority of previous literature approaches formulate global optimization in the FS because they frequently use modeling environments such as GAMS that essentially require an equation-oriented modeling approach. Recently, Hart et al. (2017) developed the Python-based optimization tool Pyomo which allows modeling, implementation of own solvers, and provides access to multiple solvers. Pyomo allows both FS and RS and recently Hüllen et al. (2019) demonstrated RS optimization of ANNs in BARON through Pyomo. However, BARON relies on the auxiliary variable method for relaxations which results in larger subproblems (Schweidtmann et al. 2020a). Thus, a RS formulation in BARON does not take full advantage of the RS formulation. In contrast, our open-source solver MAiNGO (Bongartz et al. 2018) relies on McCormick relaxations in the space of the original variables utilizing the library MC++ (Mitsos et al. 2009; Chachuat et al. 2015). Another open-source solver that allows for McCormick relaxations is called EAGO has been released by Wilhelm and Stuber (2020).

The optimization problems in this work are implemented in MeLOn (Schweidtmann et al. 2020b) and solved by MAiNGO (Bongartz et al. 2018). For comparison, the problems are additionally exported to GAMS and solved by the commercial solver BARON (Tawarmalani and Sahinidis 2005). We provide the implementation of the one-class SVM in the open-source modeling toolbox MeLOn (Schweidtmann et al. 2020b).

Tight convex and concave relaxations are highly desirable in global optimization. Therefore, we use the tightest possible relaxations, i.e., the envelopes, of the radial basis function kernel in our solver MAiNGO. Note that we derived these envelopes in our previous work (Schweidtmann et al. 2020a) as the squared exponential covariance function in Gaussian processes is equivalent to the radial basis function kernel in SVMs.

3 Illustrative case studies

As high dimensional problems are difficult to visualize, we consider eight two-dimensional data sets for illustration of the proposed method in a first step. Afterwards, we consider an engineering case study in Sect. 4. As shown in Fig. 5, the illustrative examples cover a variety of relevant scenarios. All data points are randomly generated within pre-specified bounds and perturbed by noise. Thus, the data sets do not exhibit sharp boundaries, rather they also include noisy outlier data points.

In the next step, we evaluate an adapted peaks function, \(f_{\text {Peaks}} : \mathbb {R}^2 \mapsto \mathbb {R}\), on all data sets with \(f_{\text {Peaks}}(x_1,x_2) = 3(1-x_1)^2 \cdot \exp {(-x_1^2-(x_2+1)^2)} -10(\frac{x_1}{5}-x_1^3-x_2^5)\cdot \exp {(-x_1^2 -x_2^2)} - \frac{1}{3}\cdot \exp {(-(x_1+1)^2-x_2^2)}-1.3x_2\). The peaks function is a standard test function in Matlab. We adapted the function slightly by adding a linear term which avoids a flat response outside the sampled domain. Then, we train individual ANNs on the eight data sets using Keras (Chollet et al. 2015). All ANNs exhibit one input layer with 2 neurons, two hidden layers with six and eight neurons, respectively, and an output layer with one neuron. The hidden layers use \(\tanh \) activation and the output layers use linear activation. For training, the inputs are scaled onto \([-1,1]\) and the outputs are scaled to zero mean and unit variance. We further use a batch size of 128 and an epoch limit of 4000. Note that we omit a hyperparameter study for the ANNs because ANN training is not the focus of this work.

All optimization problems are solved one core of an Intel Xeon CPU E5-2630 v3 (2.40GHz) with 192 GB RAM and Windows Server 2016 operating system. We use a 0.001 relative and absolute optimality tolerance, a CPU time limit of 1000 s, and default settings in BARON and MAiNGO.

3.1 Topological data analysis

The persistent diagrams of the eight input data sets are shown in Fig. 4. The Figs.  4d, e show a H\(_1\) component with a large life span each. These correspond to the holes in the respective data sets “box w/ hole” and “circle w/ hole”. Moreover, the corresponding \(\epsilon _{\text {death}}\) provide information about the diameter of the holes. Recall that H\(_1\) components that are close to the diagonal have a short life span and are therefore not relevant for this analysis.

The persistent diagrams also show the existence of disjunct clusters in the data sets “two circles” and “two ovals”. In Figs. 4f, g, the H\(_0\) components with a high \(\epsilon _{\text {death}}\) indicate disjunct clusters and are a measure for the distance between them. Note that the \(H_0\) components at the infinity line persist when \(\epsilon \) goes to infinity and do not die. They correspond to the \(H_0\) components that include all data points.

The persistent diagrams show no distinct differences between the “box”, “oval”, “box2”, and “banana” case studies. This illustrates that the method cannot distinguish between convex and nonconvex data sets in general. Therefore, the persistent diagrams cannot ensure that the convex hull is sufficient to describe the validity domain. Rather, it can only identify some cases where the convex hull is not sufficient.

Fig. 4
figure 4

Comparison of the persistent homology plots of the eight illustrative data sets

3.2 Validity domain modeling

In order to compare the one-class SVM and the convex hull, we model the input data of the eight case studies with both methods. We employ the common radial basis function kernel for the one-class SVM. We set the hyperparameter \(\nu \) to a low value 0.03 because there are only a few outliers through noise in the data. The hyperparameter \(\gamma \) is identified using the incremental approach described in Sect. 2.2. The selected \(\gamma \) values are summarized in Table 1.

Table 1 The values of the hyperparameter \(\gamma \) for the eight case studies. The hyperparameters are selected based on the incremental approach described in Sect. 2.2

The learned boundaries for the case studies are depicted in Fig. 5. As expected, the convex hull does not model holes and disjunct data clusters. Instead, the convex hull overestimates the validity domain of the case studies. In contrast, the one-class SVM is able to model holes and disjunct data clusters. Furthermore, the convex hull includes all data points whereas the one-class SVM also allows for outliers in the data and excludes regions with only small data density from the validity domain (e.g., see the “box” case study). Notably, the one-class SVM is more stringent in these cases compared methods that rely on the distance to the closest training data points (Teixeira et al. 2006; Rall et al. 2019).

Fig. 5
figure 5

Comparison of the convex hull and the boundaries learned by the one-class SVMs

3.3 Optimization results

We minimize the prediction of the eight trained ANNs subject to the convex hull or the one-class SVM as constraints. Table 2 shows the optimal solution points, \(\mathbf{x} ^*\), and objective function values, \(f_{\text {ANN}}(\mathbf{x} ^*)\), for the problem with convex hull and one-class SVM as constraints.

Table 2 The table compares the global optimal solutions with convex hull and one-class SVMs (SVMs) as constraints. The optimal solution point is given by \(x^*\) and the objective function value is given by \(f_{\text {ANN}}^*\). Also, we provide the error of the data-driven model at the optimal solution \(\varDelta =\vert f_{\text {ANN}}^*-f_{\text {Peaks}}(x^*)\vert \). The reference solution point shows the optimal solution when considering the underlying function \(f_{\text {Peaks}}\) as the objective with the one-class SVM constraint

The optimal objective function values of the convex hull approach are lower than the ones with the one-class SVM for all case studies because the convex hull overestimates the validity domain. This overestimation can lead to large errors at the optimal solution points. For the “banana” case study, the optimal solution found by the convex hull approach is outside the validity domain but at the boundary of the convex hull (see Table 2). This leads to a wrongly estimated objective values by the ANN of \(-5.2\) with an absolute error of 8.52. In contrast, the one-class SVM models the validity domain accurately and yields an optimal solution of \(-4.3\) with an absolute error of 0.11. Also, the solution point found by the ANN model with the one-class SVM constraint is close to the reference solution where the learned peaks function is optimized subject to the SVM constraint. Similarly, the one-class SVM leads to more reliable results in the data sets “two circles”, “box w/ holes”, and “circle w/ holes”. Interestingly, the convex hull approach also leads to a substantial prediction error in the “box” case study while the one-class SVM models the validity domain accurately. This highlights the risk of using the convex hull approach in the presence of noise. Note that methods which rely on the distance to the closest training data points face similar issues.

In Table 3, we provide the CPU times for optimization with one-class SVMs embedded. Using the FS formulation, BARON and MAiNGO perform similarly and solve most problems in the a few hundred CPU seconds. The RS formulation outperforms the FS formulation on all problem instances. In BARON, the speedup factor between the RS and the FS formulation ranges from 5 to over 14. In comparison, the speedup factor between the RS and FS in MAiNGO ranges between 583 to over 3226. This is in agreement with our previous studies where the McCormick relaxations in the RS lead to smaller subproblems compared to the auxiliary variable method (see Sect. 2.3).

Table 3 CPU times for optimization of the eight case studies with the one-class SVM as a constraint. The table compares the FS and RS formulations for the BARON and MAiNGO solvers. Here, the data-driven model (i.e., the ANN) and the one-class SVM are formulated in the RS and FS. The speed up factor (sp-f) is given as the ratio between the FS and the RS solution times. Also, the number of support vectors (# Sup. vec.) is shown as a measure for the problem complexity

In Table 4, we compare the computational performance for optimization with the convex hull embedded. The CPU times with the convex hull are lower compared to the ones with one-class SVMs. MAiNGO is substantially faster than BARON when formulating the problem in the FS. On average, BARON requires about 83 s to solve the problem in the FS while MAiNGO requires only 3 s. The RS formulation again outperforms the FS formulation for all problems. However, in this case, the speedup factors are in the same order of magnitude for BARON and MAiNGO ranging between 13 and 54. It should be noted that the RS and the FS formulation of the convex hull constrains are identical. Therefore, the difference is only due to the formulation of the data-driven model in the objective function, i.e., the ANN (c.f. Schweidtmann and Mitsos 2019).

Table 4 CPU times for optimization of the eight case studies with the convex hull as a constraint. The table compares the FS and RS formulations for the BARON and MAiNGO solvers. The speed up factor (sp-f) is given as the ratio between the FS and the RS solution times. Also, the number of support vectors (# Sup. vec.) is shown as a measure for the problem complexity

4 Engineering application

We consider a sulfur recovery unit as a relevant engineering case study for our work because a large data set of industry operating data is available online for this process. The efficient recovery of sulfur in petroleum refineries from tail gas is important for environmental reasons. The sulfur recovery unit process is illustrated in Fig. 6. The process has two acid gases as inputs: the MAE gas stream is rich in hydrogen sulphide (H\(_2\)S) and comes from the gas washing plants. The SWS gas stream is rich in H\(_2\)S and ammonia and comes from a sour water stripping plant. In the sulfur recovery unit, the acid gases are burnt via partial reaction with air in a two-chamber reaction furnace. Then, the combustion product is further treated in two subsequent catalytic reactors resulting in a tail gas stream that contains residuals of H\(_2\)S and sulfur dioxide (SO\(_2\)). A detailed process description can be found in the literature (Fortuna et al. 2007).

Fig. 6
figure 6

Flowchart of the sulfur recovery unit process

A key issue of the sulfur recovery unit is the control of the secondary air flow to ensure optimal conditions for the total removal of the sulfur compounds in the catalytic converters. Previous works have investigated soft sensors for the tail gas concentrations of hydrogen sulphide (H\(_2\)S) and sulfur dioxide (SO\(_2\)) using ANNs and implemented those in industry for monitoring (Quek et al. 2000; Fortuna et al. 2003, 2007). In this case study, we solve an open-loop control problem to find the optimal secondary air flow rate. The objective is to minimize \(\vert c_{H_2S}-2\cdot c_{SO_2} \vert \) such that the two reactants are in stoichiometric proportion. Similar to the previous literature by Fortuna et al. (2003, 2007), we also train two ANNs to predict the concentrations:

$$\begin{aligned}&c_{H_2S, k}&= f_{{ANN, H_2S}} \left( x_{1,k}, x_{1,k-5}, x_{1,k-7}, x_{1,k-9}, \ldots x_{5,k}, x_{5,k-5}, x_{5,k-7}, x_{5,k-9} \right) , \\&c_{SO_2, k}&= f_{{ANN, SO_2}} \left( x_{1,k}, x_{1,k-5}, x_{1,k-7}, x_{1,k-9}, \ldots x_{5,k}, x_{5,k-5}, x_{5,k-7}, x_{5,k-9} \right) , \end{aligned}$$

where \(x_{1,k}\) is the gas flow in the MEA zone, \(x_{2,k}\) is the air flow in the MEA zone, \(x_{3,k}\) is the secondary air flow in the MEA zone, \(x_{4,k}\) is the air flow in the SWS zone, \(x_{5,k}\) is the gas flow in the SWS zone at time step k. The \(SO_2\) ANN has one hidden layer with eight neurons and the \(H_2S\) ANN has two hidden layers with eight neurons each. The data-driven models are trained on (scaled) industrial data collected at a plant located in Priolo, Italy available at https://www.openml.org/d/23515. The data set includes a time series with approximately 10,000 data samples and we use the first 90% of the data for training. The control variable of the NMPC is the secondary air flow \(x_{3,k}\) while the other inputs are observable parameters. As the control is critical for process safety, the validity limits of the data-driven model should be considered.

In order to analyze the topology of the 20-dimensional input training data set of ANNs, we perform persistent homology. Due to the large number of data points, the exact computation of the persistent diagram is expensive. We apply approximate sparse filtration instead (Cavanna et al. 2015). The persistent diagram for this case study is shown in Fig. 7. The diagram shows that there exist a number of holes in the data set that persist over a long time span. Also, a separate cluster can be observed in the data. This motivates the use of one-class SVM to obey validity limits of data-driven models.

Fig. 7
figure 7

Persistent diagram for of the training data of the engineering case study presented in Sect. 4

As we have no physical model of the process available, the closed-loop performance of the controller is not studied in this example. For illustration, we perform one step of an open-loop controller for the secondary air flow \(x_{3,k}\). We select a random operating point from the historic plant data (Table 5) and let the solver determine the optimal control action. The problem is solved to global optimality within 0.33 CPU seconds and identifies a control action \(x_{3,k}=0.266\) that results in the desired stoichiometric composition, i.e., \(\vert c_{H_2S}-2\cdot c_{SO_2} \vert =1.3\cdot 10^{-5}\). This engineering case study also demonstrates the potential of the proposed method for NMPC. Note that deterministic global NMPC can become computationally expensive for long control horizons and higher dimensional control vectors (Chachuat et al. 2006; Doncevic et al. 2020; Kappatou et al. 2020).

Table 5 Operating point of the sulfur recovery unit that is considered for the NMPC optimization step

5 Conclusion

Safety concerns and extrapolation issues often impede industrial applications of machine learning models. We present a three-step approach to obey the validity limits of data-driven models. First, we perform a data topology analysis using persistent homology. Second, we model the validity domain of the data-driven model using either the convex hull or a one-class SVM. Third, we perform deterministic global optimization with the validity domain model as a constraint.

All used and developed methods are available open-source. Also, we currently develop a Python interface for our solver MAiNGO. Thus, all methods can be applied and further developed in academia and industry for free.

Our method has the potential to enhance safety, trust, and reliability of machine learning approaches. Moreover, we demonstrate that persistent homology is a valuable method for understanding the topology of data in high dimensional spaces. Besides industry applications, promising future work also includes the application to optimization problems occurring in molecular design where molecules are parameterized through graph neural networks (Schweidtmann et al. 2020c) or autoencoders (Jin et al. 2018). Also, time-dependent design space descriptions are desired in pharmaceutics (von Stosch et al. 2020). The proposed method can also be extended by considering and comparing other one-class classification methods. Finally, the extension of the presented method to adaptive space exploration would be promising future work.