Reinforcement learning based optimal control of batch processes using Monte-Carlo deep deterministic policy gradient with phase segmentation

doi:10.1016/j.compchemeng.2020.107133

Computers & Chemical Engineering

Volume 144, 4 January 2021, 107133

https://doi.org/10.1016/j.compchemeng.2020.107133 Get rights and content

Highlights

•
Design of reward function is suggested for the general economic process control.
•
Phase segmentation approach is proposed to address distinct characteristics of various phases of a batch run.
•
DDPG algorithm is modified with Monte-Carlo learning for stable agent training.
•
Suggested algorithm is applied to a batch polymerization process control problem.

Abstract

Batch process control represents a challenge given its dynamic operation over a large operating envelope. Nonlinear model predictive control (NMPC) is the current standard for optimal control of batch processes. The performance of conventional NMPC can be unsatisfactory in the presence of uncertainties. Reinforcement learning (RL) which can utilize simulation or real operation data is a viable alternative for such problems. To apply RL to batch process control effectively, however, choices such as the reward function design and value update method must be made carefully. This study proposes a phase segmentation approach for the reward function design and value/policy function representation. In addition, the deep deterministic policy gradient algorithm (DDPG) is modified with Monte-Carlo learning to ensure more stable and efficient learning behavior. A case study of a batch polymerization process producing polyols is used to demonstrate the improvement brought by the proposed approach and to highlight further issues.

Introduction

Batch or semi-batch processing is widely used in the process industry, mostly for producing low-volume, high-value added products. Operating condition of a batch process is determined to meet given requirements for end product quality (e.g., composition, particle size and shape) in a manner that assures safety and economic feasibility (e.g., maximization of the productivity or minimization of the cost). However, its inherent features, such as 1) non-stationary operation covering a wide operating envelope 2) consequent exposure of the underlying nonlinear dynamics, and 3) existence of both path and end-point constraints, present a significant challenge to control engineers. These problems are exacerbated by oftentimes significant variabilities in the feedstock quality and condition as well as other process uncertainties (e.g. disturbances, noises, model errors).

For optimal control of batch processes, nonlinear model predictive control (NMPC) has been the most widely studied method (Qin, Badgwell, 2000, Mayne, 2000, Chang, Liu, Henson, 2016). NMPC can be designed in two different forms. First it can be designed to follow a prespecified recipe (e.g., setpoint trajectories), which are determined from an off-line (or run-to-run) optimization. Alternatively, it can be designed to optimize an economic or other performance index directly on-line while respecting process and quality constraints (Rawlings, Amrit, 2009, Ellis, Durand, Christofides, 2014). In this setting, as an economic optimization tends to drive the operating recipe towards active constraints, it is important for the controller to assure constraint satisfaction. While NMPC offers the advantage of including constraints directly in the optimal control calculation, its constraint handling capability can degrade when the model used for the prediction has significant uncertainty (Lucia et al., 2014). Conventional NMPC addresses this problem by reperforming the optimization at each sample time after receiving feedback measurements and also by detuning the controller and tightening the constraints to account for the uncertainty (“back-off”) but this can lead to significant suboptimality (Paulson, Mesbah, 2018, Santos, Bonzanini, Heirung, Mesbah, 2019). Robust NMPC formulations have also appeared which are based on rigorous uncertainty models (e.g., bounded parameter sets, scenario trees) but they come at the cost of significantly increased complexity preventing practical use Morari and H. Lee (1999), Lucia et al. (2013), Thangavel et al. (2018).

In this regard, reinforcement learning (RL), which tries to learn an optimal control policy and value function from data may be worthy of consideration. Learning data can be real operation data or simulation data and such data contain (or can be made to contain) the effect of the uncertainty which will be experienced on-line. Therefore, RL provides a flexible framework wherein a control policy can be learned to address the uncertainty contained in the learning data and its potential has been examined through several set-point tracking control problems (Lee, Lee, 2005, Spielberg, Tulsyan, Lawrence, Loewen, Bhushan Gopaluni, 2019, Kim, Park, Yoo, Oh, Lee, Lee, 2020). For such problems, the set point tracking error was used as the (negative) reward in the training of the RL based controller. On the other hand, just a few studies have examined RL in the context of dynamic optimization of batch process operation (Wilson, Martinez, 1997, Martinez, 1998, Petsagkourakis, Sandoval, Bradford, Zhang, del Rio-Chanona, 2020).

In this paper, we examine the use of RL for on-line dynamic optimization and control of a batch process with high dimensional and continuous state and action spaces. We focus on the issues that arise in such applications, e.g., the choice of reward functions, esp. in handling constraints, construction of time-dependent approximators, and choice of learning algorithm. First of all, we suggest three types of reward functions that combine economic performance index and degree of violation of path and end-point constraints. To address the time dependence of batch operation, a phase segmentation approach is proposed to tailor the reward functions and the function approximators to suit distinct characteristics of the different phases of batch run. For the agent training, deep deterministic policy gradient (DDPG) algorithm (Lillicrap et al., 2016) is adopted as it is known to be effective in handling high dimensional continuous state and action spaces. However, the traditional DDPG algorithm, which uses temporal difference learning, is modified with Monte-Carlo (MC) learning so as to ensure stable learning behavior. This is particularly important for batch process control as the reward function is designed such that violation of the end-point constraints affect the overall reward significantly. The NMPC and different RL control strategies will be examined and compared in a case study that involves a semi-batch polymerization reactor producing polyols (Nie et al., 2013).

In summary, the intended contributions of this paper are in 1) suggesting the RL method for optimal control of batch processes with uncertainty, 2) proposing the phase segmentation approach for designing the reward functions and the function approximators and 3) modifying the DDPG algorithm with MC learning to improve the agent learning in order to better reflect the future value, particularly with respect to the end point constraints satisfactions.

In the next section we briefly introduce MPC, RL, MC vs. TD (Temporal-Difference) learning, and DDPG. Section 3 then proposes a RL based optimal control strategy with specific components like the phase segmentation, and MC-DDPG learning. The case study problem is introduced next and the training environment of the RL based optimal controller is described in Section 4. In Section 5, the improvement by using the phase segmentation approach is demonstrated and the learning results of the conventional DDPG and MC-DDPG algorithms as well as NMPC are compared. In addition, sensitivity analysis with respect to the choice of phase segmentation point and to the hyperparameters used in the reward design are investigated. Section 6 summarizes and concludes the paper.

Section snippets

MPC and RL

The optimal control problem formulation we adopt in this study is given in Eq. (2.1). A general method to solve this problem is by solving the Hamilton-Jacobi-Bellman (HJB) equation, which can be derived by applying Bellman’s principle of optimality. On the other hand, the HJB equation is a PDE with boundary conditions and cannot be solved in most cases with just a few well-known exceptions, e.g., the linear quadratic optimal control problem. MPC and RL represent two different approaches to

Reward design

The goal of batch process control is to produce target products in an economical and safe manner. As in classical optimal control, goal engineering is needed to express the goal as the reward function (i.e. the objective function). In this regard, three types of reward terms (as shown in Table 1) are suggested to allow the agent to be trained effectively for the purposes of achieving high economic performance, and satisfying relevant path and end-point constraints. The first term ( $r_{p a t h}$ )

Propylene oxide (PO) batch polymerization

To evaluate and illustrate the performance of the proposed MC-DDPG with phase segmentation strategy for batch process control, a polyether polyol process for polypropylene glycol production is chosen as a case study. This process has significant nonlinear dynamics and involves both path and end constraints. The monomer PO first reacts with the alkaline anion and then the oxy-propylene anion undertakes the propagation, followed by the cation-exchange and proton-transfer reactions. A

Results & discussion

To demonstrate the benefit of using the suggested reward design strategy with phase segmentation and the MC-DDPG algorithm, we compare 1) the results of the agent trained with and without the phase segmentation, 2) the learning performance of the agents using the MC-DDPG and the conventional DDPG (TD-DDPG), and 3) the control performances of the agent trained with the MC-DDPG algorithm, the agent initialized by the supervised learning with the NMPC closed-loop data (i.e., imitation learning),

Conclusion

An RL based batch process control strategy was proposed, with particular attention given to the design of the reward to properly reflect the process’s economic performance and (path/end-point) constraint satisfaction. In addition, the phase segmentation approach and the Monte-Carlo DDPG algorithm were suggested to handle the non-stationary and irreversible characteristics of most batch processes. The suggested RL strategy was tested on a batch polymerization example and the beneficial effects

CRediT authorship contribution statement

Haeun Yoo: Conceptualization, Methodology, Software, Writing - original draft. Boeun Kim: Validation, Writing - review & editing. Jong Woo Kim: Resources, Writing - review & editing. Jay H. Lee: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare no conflict of interest.

Acknowledgement

This research was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0008475, The Competency Development Program for Industry Specialist).

References (45)

O. Abel et al.
Productivity optimization of an industrial semi-batch polymerization reactor under safety constraints
J. Process Control
(2000)
L.T. Biegler
An overview of simultaneous strategies for dynamic optimization
Chem. Eng. Process.
(2007)
L.T. Biegler
An overview of simultaneous strategies for dynamic optimization
Chem. Eng. Process.
(2007)
L. Chang et al.
Nonlinear model predictive control of fed-batch fermentations using dynamic flux balance models
J. Process Control
(2016)
M. Ellis et al.
A tutorial review of economic model predictive control methods
J. Process Control
(2014)
H. Jang et al.
A robust NMPC scheme for semi-batch polymerization reactors
IFAC-PapersOnLine
(2016)
T.Y. Jung et al.
Model-based on-line optimization framework for semi-batch polymerization reactors
IFAC-PapersOnLine
(2015)
J.W. Kim et al.
A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system
J. Process Control
(2020)
J.H. Lee et al.
Worst-case formulations of model predictive control for systems with bounded parameters
Automatica
(1997)
J.M. Lee et al.
Approximate dynamic programming-based approaches for input-output data-driven control of nonlinear processes
Automatica
(2005)

S. Lucia et al.

Handling uncertainty in economic nonlinear model predictive control: a comparative case study

J. Process Control

(2014)

S. Lucia et al.

Multi-stage nonlinear model predictive control applied to a semi-batch polymerization reactor under uncertainty

J. Process Control

(2013)

E.C. Martinez

Learning to control the performance of batch processes

Chem. Eng. Res. Des.

(1998)

E. Mastan et al.

Method of moments: a versatile tool for deterministic modeling of polymerization kinetics

Eur. Polym. J.

(2015)

M. Morari et al.

Model predictive control: past, present and future

Comput. Chem. Eng.

(1999)

J.A. Paulson et al.

Nonlinear model predictive control with explicit backoffs for stochastic systems under arbitrary uncertainty

IFAC-PapersOnLine

(2018)

P. Petsagkourakis et al.

Reinforcement learning for batch bioprocess optimization

Comput. Chem. Eng.

(2020)

S. Thangavel et al.

Dual robust nonlinear model predictive control: amulti-stage approach

J. Process Control

(2018)

P.I. Barton et al.

Dynamic optimization in a discontinuous world

Ind. Eng. Chem. Res.

(1998)

D. Bonvin et al.

Dynamic Optimization in the Batch Chemical Industry

Technical Report

(2001)

S. Fujimoto et al.

Addressing function approximation error in actor-critic methods

35th International Conference on Machine Learning, ICML 2018

(2018)

W.E. Hart et al.

Pyomo–optimization modeling in Python

(2017)

Cited by (77)

Optimal tracking control of batch processes with time-invariant state delay: Adaptive Q-learning with two-dimensional state and control policy
2024, Engineering Applications of Artificial Intelligence
Given that conventional model-based control methods have some limitations for dynamic systems with unknown model parameters and existing reinforcement learning methods do not take batch and time delay information into account, a novel data-based adaptive Q-learning approach with two-dimensional (2D) state and control policy is proposed to address the optimal tracking control issue for batch processes with time-invariant state delay. The extended delay state space equation, value function, Q function and optimal performance index are initially presented along the time and batch directions. By examining the correlation between the 2D value function and the 2D Q function, a delay-dependent 2D Bellman equation is designed independent of the process model, which is solved to obtain the expression of the control law. Without requiring prior knowledge of the system, the optimal gain matrices of the control law are further learned by using the current and historical state, output error values and time delay information of the timewise and batchwise. It is feasible to achieve accelerated convergence and reduced errors between the optimal control gain matrices and the learning gain matrices, hence enhancing the tracking capabilities of the systems. At the same time, the unbiasedness and convergence of the given adaptive Q-learning approach are strictly proved. The effectiveness of the proposed algorithm is ultimately validated by simulation comparisons of injection molding, specifically regarding the convergence of control gains and the tracking of output.
Machine learning for industrial sensing and control: A survey and practical perspective
2024, Control Engineering Practice
With the rise of deep learning, there has been renewed interest within the process industries to utilize data on large-scale nonlinear sensing and control problems. We identify key statistical and machine learning techniques that have seen practical success in the process industries. To do so, we start with hybrid modeling to provide a methodological framework underlying core application areas: soft sensing, process optimization, and control. Soft sensing contains a wealth of industrial applications of statistical and machine learning methods. We quantitatively identify research trends, allowing insight into the most successful techniques in practice. We consider two distinct flavors for data-driven optimization and control: hybrid modeling in conjunction with mathematical programming techniques and reinforcement learning. Throughout these application areas, we discuss their respective industrial requirements and challenges. A common challenge is the interpretability and efficiency of purely data-driven methods. This suggests a need to carefully balance deep learning techniques with domain knowledge. As a result, we highlight ways prior knowledge may be integrated into industrial machine learning applications. The treatment of methods, problems, and applications presented here is poised to inform and inspire practitioners and researchers to develop impactful data-driven sensing, optimization, and control solutions in the process industries.
Model-based safe reinforcement learning for nonlinear systems under uncertainty with constraints tightening approach
2024, Computers and Chemical Engineering
In chemical processes, the safety constraints must be satisfied despite any uncertainties. Reinforcement learning is an algorithm that learns optimal control policies through interaction with the system. Recently, studies have shown that well-trained controllers can improve the performance of chemical processes, but the actual application requires additional schemes to satisfy the constraints. In our previous work, we proposed a model-based safe RL in which both state and input constraints can be considered by introducing barrier functions into the objective function. This study extends our previous model-based safe RL to consider the constraints with model-plant mismatches and stochastic disturbances. The Gaussian processes are employed to predict the expectation and variance of errors in constraints caused by uncertainties. Subsequently, these are further used to tighten the constraint by backoffs. With these adaptive backoffs, the safe RL can satisfy chance constraints and learn the optimal control policy of the uncertain nonlinear system.
Quantitative comparison of reinforcement learning and data-driven model predictive control for chemical and biological processes
2024, Computers and Chemical Engineering
As manufacturing processes transition towards digitalization, data-driven process control is emerging as a key area of interest in future artificial intelligence technology. A crucial aspect in implementing data-driven process control is “What should we learn from the data?”. In general, the data-driven control method can be categorized into two main approaches: Learning the model and learning the value. To assist in selecting the more suitable approach, this paper applies six different control methods, with three falling under each approach, to three distinct manufacturing process systems. The simulation results indicate that the model-learning approaches display higher data efficiency and exhibit lower variance in total cost. These methods prove to be particularly advantageous for addressing the regulation problems. Conversely, value-learning approaches show competitive potential in closed-loop identification and in managing economic cost problems. The remaining challenges associated with each technique are discussed, along with practical considerations for their implementation.
Hierarchical energy management of plug-in hybrid electric trucks based on state-of-charge optimization
2023, Journal of Energy Storage
To solve the contradiction between the optimality of the energy management strategy and the adaptability of the driving conditions of the hybrid electric vehicle. This paper proposes a hierarchical energy management strategy for plug-in hybrid electric trucks based on the state of charge (SOC). In the upper hierarchical layer, the deep deterministic policy gradient (DDPG) algorithm is used to generate the SOC reference value for the future driving section based on the historical driving condition information, to guide the SOC of the battery and make the SOC run in a reasonable range. Ultimately, it enables optimal battery discharge for plug-in hybrid electric trucks throughout the journey. In the lower hierarchical layer, the long short-term memory (LSTM) neural network algorithm is used to predict vehicle speed in the short term. Meanwhile, the model predictive control of the vehicle is constructed to distribute the required power of the plug-in hybrid electric truck in real-time, and the penalty factor is introduced into the objective function to accurately follow the SOC reference value. The simulation results show that the control strategy in this paper saves 16.34 % of fuel compared with the charge-depleting/charge-sustaining (CD/CS) strategy, which greatly improves fuel economy. At the same time, the simulation results of different driving conditions show that the fuel-saving rate is about 16 %, which verifies the robustness of the proposed method.
Embedding active learning in batch-to-batch optimization using reinforcement learning
2023, Automatica
Batch-to-batch (B2B) or run-to-run (R2R) optimization refers to the strategy of updating the operating parameters of a batch run based on the results of previous runs and exploits the repetitive nature of batch process operation. Although B2B optimization uses feedback from previous batch runs to learn about model uncertainty and improve the operation of future runs, the standard techniques have the limitations of passive learning and being myopic in making adjustments. This work proposes a novel way to use the reinforcement learning approach to embed the active learning feature into B2B optimization. For this, the B2B optimization problem is formulated as a maximization of a long-term performance of repeated batch runs, which are modeled as a stochastic process with uncertain parameters. To solve the resulting Bayes-Adaptive Markov decision process (BAMDP) problem in a near-optimal manner, a policy gradient reinforcement learning algorithm is employed. Through case studies, the behavior and effectiveness of the proposed B2B optimization method are examined by comparing it with the traditional certainty equivalence based B2B optimization method with passive learning.

View all citing articles on Scopus

View full text

Reinforcement learning based optimal control of batch processes using Monte-Carlo deep deterministic policy gradient with phase segmentation

Highlights

Abstract

Introduction

Section snippets

MPC and RL

Reward design

Propylene oxide (PO) batch polymerization

Results & discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

J. Process Control

Chem. Eng. Process.

Chem. Eng. Process.

J. Process Control

J. Process Control

IFAC-PapersOnLine

IFAC-PapersOnLine

J. Process Control

Automatica

Automatica

J. Process Control

J. Process Control

Chem. Eng. Res. Des.

Eur. Polym. J.

Comput. Chem. Eng.

IFAC-PapersOnLine

Comput. Chem. Eng.

J. Process Control

Dynamic optimization in a discontinuous world

Ind. Eng. Chem. Res.

Dynamic Optimization in the Batch Chemical Industry

Technical Report

Addressing function approximation error in actor-critic methods

35th International Conference on Machine Learning, ICML 2018

Pyomo–optimization modeling in Python