Reinforcement learning based optimal control of batch processes using Monte-Carlo deep deterministic policy gradient with phase segmentation

https://doi.org/10.1016/j.compchemeng.2020.107133Get rights and content

Highlights

  • Design of reward function is suggested for the general economic process control.

  • Phase segmentation approach is proposed to address distinct characteristics of various phases of a batch run.

  • DDPG algorithm is modified with Monte-Carlo learning for stable agent training.

  • Suggested algorithm is applied to a batch polymerization process control problem.

Abstract

Batch process control represents a challenge given its dynamic operation over a large operating envelope. Nonlinear model predictive control (NMPC) is the current standard for optimal control of batch processes. The performance of conventional NMPC can be unsatisfactory in the presence of uncertainties. Reinforcement learning (RL) which can utilize simulation or real operation data is a viable alternative for such problems. To apply RL to batch process control effectively, however, choices such as the reward function design and value update method must be made carefully. This study proposes a phase segmentation approach for the reward function design and value/policy function representation. In addition, the deep deterministic policy gradient algorithm (DDPG) is modified with Monte-Carlo learning to ensure more stable and efficient learning behavior. A case study of a batch polymerization process producing polyols is used to demonstrate the improvement brought by the proposed approach and to highlight further issues.

Introduction

Batch or semi-batch processing is widely used in the process industry, mostly for producing low-volume, high-value added products. Operating condition of a batch process is determined to meet given requirements for end product quality (e.g., composition, particle size and shape) in a manner that assures safety and economic feasibility (e.g., maximization of the productivity or minimization of the cost). However, its inherent features, such as 1) non-stationary operation covering a wide operating envelope 2) consequent exposure of the underlying nonlinear dynamics, and 3) existence of both path and end-point constraints, present a significant challenge to control engineers. These problems are exacerbated by oftentimes significant variabilities in the feedstock quality and condition as well as other process uncertainties (e.g. disturbances, noises, model errors).

For optimal control of batch processes, nonlinear model predictive control (NMPC) has been the most widely studied method (Qin, Badgwell, 2000, Mayne, 2000, Chang, Liu, Henson, 2016). NMPC can be designed in two different forms. First it can be designed to follow a prespecified recipe (e.g., setpoint trajectories), which are determined from an off-line (or run-to-run) optimization. Alternatively, it can be designed to optimize an economic or other performance index directly on-line while respecting process and quality constraints (Rawlings, Amrit, 2009, Ellis, Durand, Christofides, 2014). In this setting, as an economic optimization tends to drive the operating recipe towards active constraints, it is important for the controller to assure constraint satisfaction. While NMPC offers the advantage of including constraints directly in the optimal control calculation, its constraint handling capability can degrade when the model used for the prediction has significant uncertainty (Lucia et al., 2014). Conventional NMPC addresses this problem by reperforming the optimization at each sample time after receiving feedback measurements and also by detuning the controller and tightening the constraints to account for the uncertainty (“back-off”) but this can lead to significant suboptimality (Paulson, Mesbah, 2018, Santos, Bonzanini, Heirung, Mesbah, 2019). Robust NMPC formulations have also appeared which are based on rigorous uncertainty models (e.g., bounded parameter sets, scenario trees) but they come at the cost of significantly increased complexity preventing practical use Morari and H. Lee (1999), Lucia et al. (2013), Thangavel et al. (2018).

In this regard, reinforcement learning (RL), which tries to learn an optimal control policy and value function from data may be worthy of consideration. Learning data can be real operation data or simulation data and such data contain (or can be made to contain) the effect of the uncertainty which will be experienced on-line. Therefore, RL provides a flexible framework wherein a control policy can be learned to address the uncertainty contained in the learning data and its potential has been examined through several set-point tracking control problems (Lee, Lee, 2005, Spielberg, Tulsyan, Lawrence, Loewen, Bhushan Gopaluni, 2019, Kim, Park, Yoo, Oh, Lee, Lee, 2020). For such problems, the set point tracking error was used as the (negative) reward in the training of the RL based controller. On the other hand, just a few studies have examined RL in the context of dynamic optimization of batch process operation (Wilson, Martinez, 1997, Martinez, 1998, Petsagkourakis, Sandoval, Bradford, Zhang, del Rio-Chanona, 2020).

In this paper, we examine the use of RL for on-line dynamic optimization and control of a batch process with high dimensional and continuous state and action spaces. We focus on the issues that arise in such applications, e.g., the choice of reward functions, esp. in handling constraints, construction of time-dependent approximators, and choice of learning algorithm. First of all, we suggest three types of reward functions that combine economic performance index and degree of violation of path and end-point constraints. To address the time dependence of batch operation, a phase segmentation approach is proposed to tailor the reward functions and the function approximators to suit distinct characteristics of the different phases of batch run. For the agent training, deep deterministic policy gradient (DDPG) algorithm (Lillicrap et al., 2016) is adopted as it is known to be effective in handling high dimensional continuous state and action spaces. However, the traditional DDPG algorithm, which uses temporal difference learning, is modified with Monte-Carlo (MC) learning so as to ensure stable learning behavior. This is particularly important for batch process control as the reward function is designed such that violation of the end-point constraints affect the overall reward significantly. The NMPC and different RL control strategies will be examined and compared in a case study that involves a semi-batch polymerization reactor producing polyols (Nie et al., 2013).

In summary, the intended contributions of this paper are in 1) suggesting the RL method for optimal control of batch processes with uncertainty, 2) proposing the phase segmentation approach for designing the reward functions and the function approximators and 3) modifying the DDPG algorithm with MC learning to improve the agent learning in order to better reflect the future value, particularly with respect to the end point constraints satisfactions.

In the next section we briefly introduce MPC, RL, MC vs. TD (Temporal-Difference) learning, and DDPG. Section 3 then proposes a RL based optimal control strategy with specific components like the phase segmentation, and MC-DDPG learning. The case study problem is introduced next and the training environment of the RL based optimal controller is described in Section 4. In Section 5, the improvement by using the phase segmentation approach is demonstrated and the learning results of the conventional DDPG and MC-DDPG algorithms as well as NMPC are compared. In addition, sensitivity analysis with respect to the choice of phase segmentation point and to the hyperparameters used in the reward design are investigated. Section 6 summarizes and concludes the paper.

Section snippets

MPC and RL

The optimal control problem formulation we adopt in this study is given in Eq. (2.1). A general method to solve this problem is by solving the Hamilton-Jacobi-Bellman (HJB) equation, which can be derived by applying Bellman’s principle of optimality. On the other hand, the HJB equation is a PDE with boundary conditions and cannot be solved in most cases with just a few well-known exceptions, e.g., the linear quadratic optimal control problem. MPC and RL represent two different approaches to

Reward design

The goal of batch process control is to produce target products in an economical and safe manner. As in classical optimal control, goal engineering is needed to express the goal as the reward function (i.e. the objective function). In this regard, three types of reward terms (as shown in Table 1) are suggested to allow the agent to be trained effectively for the purposes of achieving high economic performance, and satisfying relevant path and end-point constraints. The first term (rpath)

Propylene oxide (PO) batch polymerization

To evaluate and illustrate the performance of the proposed MC-DDPG with phase segmentation strategy for batch process control, a polyether polyol process for polypropylene glycol production is chosen as a case study. This process has significant nonlinear dynamics and involves both path and end constraints. The monomer PO first reacts with the alkaline anion and then the oxy-propylene anion undertakes the propagation, followed by the cation-exchange and proton-transfer reactions. A

Results & discussion

To demonstrate the benefit of using the suggested reward design strategy with phase segmentation and the MC-DDPG algorithm, we compare 1) the results of the agent trained with and without the phase segmentation, 2) the learning performance of the agents using the MC-DDPG and the conventional DDPG (TD-DDPG), and 3) the control performances of the agent trained with the MC-DDPG algorithm, the agent initialized by the supervised learning with the NMPC closed-loop data (i.e., imitation learning),

Conclusion

An RL based batch process control strategy was proposed, with particular attention given to the design of the reward to properly reflect the process’s economic performance and (path/end-point) constraint satisfaction. In addition, the phase segmentation approach and the Monte-Carlo DDPG algorithm were suggested to handle the non-stationary and irreversible characteristics of most batch processes. The suggested RL strategy was tested on a batch polymerization example and the beneficial effects

CRediT authorship contribution statement

Haeun Yoo: Conceptualization, Methodology, Software, Writing - original draft. Boeun Kim: Validation, Writing - review & editing. Jong Woo Kim: Resources, Writing - review & editing. Jay H. Lee: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare no conflict of interest.

Acknowledgement

This research was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0008475, The Competency Development Program for Industry Specialist).

References (45)

  • S. Lucia et al.

    Handling uncertainty in economic nonlinear model predictive control: a comparative case study

    J. Process Control

    (2014)
  • S. Lucia et al.

    Multi-stage nonlinear model predictive control applied to a semi-batch polymerization reactor under uncertainty

    J. Process Control

    (2013)
  • E.C. Martinez

    Learning to control the performance of batch processes

    Chem. Eng. Res. Des.

    (1998)
  • E. Mastan et al.

    Method of moments: a versatile tool for deterministic modeling of polymerization kinetics

    Eur. Polym. J.

    (2015)
  • M. Morari et al.

    Model predictive control: past, present and future

    Comput. Chem. Eng.

    (1999)
  • J.A. Paulson et al.

    Nonlinear model predictive control with explicit backoffs for stochastic systems under arbitrary uncertainty

    IFAC-PapersOnLine

    (2018)
  • P. Petsagkourakis et al.

    Reinforcement learning for batch bioprocess optimization

    Comput. Chem. Eng.

    (2020)
  • S. Thangavel et al.

    Dual robust nonlinear model predictive control: amulti-stage approach

    J. Process Control

    (2018)
  • P.I. Barton et al.

    Dynamic optimization in a discontinuous world

    Ind. Eng. Chem. Res.

    (1998)
  • D. Bonvin et al.

    Dynamic Optimization in the Batch Chemical Industry

    Technical Report

    (2001)
  • S. Fujimoto et al.

    Addressing function approximation error in actor-critic methods

    35th International Conference on Machine Learning, ICML 2018

    (2018)
  • W.E. Hart et al.

    Pyomo–optimization modeling in Python

    (2017)
  • Cited by (77)

    View all citing articles on Scopus
    View full text