Dynamic scheduling for semiconductor manufacturing systems with uncertainties using convolutional neural networks and reinforcement learning

Liu, Juan; Qiao, Fei; Zou, Minjie; Zinn, Jonas; Ma, Yumin; Vogel-Heuser, Birgit

doi:10.1007/s40747-022-00844-0

Dynamic scheduling for semiconductor manufacturing systems with uncertainties using convolutional neural networks and reinforcement learning

Original Article
Open access
Published: 02 September 2022

Volume 8, pages 4641–4662, (2022)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Dynamic scheduling for semiconductor manufacturing systems with uncertainties using convolutional neural networks and reinforcement learning

Download PDF

2864 Accesses
6 Citations
Explore all metrics

Abstract

The dynamic scheduling problem of semiconductor manufacturing systems (SMSs) is becoming more complicated and challenging due to internal uncertainties and external demand changes. To this end, this paper addresses integrated release control and production scheduling problems with uncertain processing times and urgent orders and proposes a convolutional neural network and asynchronous advanced actor critic-based method (CNN-A3C) that involves a training phase and a deployment phase. In the training phase, actor–critic networks are trained to predict the evaluation of scheduling decisions and to output the optimal scheduling decision. In the deployment phase, the most appropriate release control and scheduling decisions are periodically generated according to the current production status based on the networks. Furthermore, we improve the four key points in the deep reinforcement learning (DRL) algorithm, state space, action space, reward function, and network structure and design four mechanisms: a slide-window-based two-dimensional state perception mechanism, an adaptive reward function that considers multiple objectives and automatically adjusts to dynamic events, a continuous action space based on composite dispatching rules (CDR) and release strategies, and actor–critic networks based on convolutional neural networks (CNNs). To verify the feasibility and effectiveness of the proposed dynamic scheduling method, it is implemented on a simplified SMS. The simulation experimental results show that the proposed method outperforms the unimproved A3C-based method and the common dispatching rules under the new uncertain scenarios.

Simulation and deep reinforcement learning for adaptive dispatching in semiconductor manufacturing systems

Article 08 October 2021

Smart scheduling of dynamic job shop based on discrete event simulation and deep reinforcement learning

Article 24 June 2023

Deep Recurrent Q-Network for Cloud Manufacturing Scheduling Problems

Introduction

Production scheduling problem of semiconductor manufacturing systems (SMSs) is one of the most complicated scheduling optimization problems in the literature due to its inherent characteristics, such as re-entrant flow, product variety, a large number of processing steps and machines, and the capital- and technical-intensive industry [31]. There are three key decision-making phases in the manufacturing process of SMS, including planning, releasing, and production scheduling. First, enterprises make production plans according to market demands in the planning phase and assign the plans to different workshops in a fixed period [15]. The production plans are generally long-term plans (e.g., annual or monthly). Each production plan consists of several tasks that include the product type, quantity, due date, and other information. For the workshop, the above tasks are orders. In the releasing phase, when the workshop receives the order, jobs are then formally released into the SMS according to a specific release strategy. The release strategy determines the release quantity, release speed, and release job type. Controlling the lot release is important for limiting work in the process at a stable level and protecting the throughput from environmental changes [4]. Finally, the manufacturing system determines the processing sequence of jobs in the production scheduling phase according to the scheduling strategy. Production scheduling is the core of manufacturing system production management.

In real-world SMSs, however, various and unexpected disturbances may occur; these disturbances can be classified as machine-related (e.g., machine failure), operator-related (e.g., operator error), and job-related (e.g., uncertain processing times) uncertainties. These uncertainties affect and even interrupt the production process. On the other hand, with the development of customized and small-scale production in the era of smart manufacturing, enterprises need to rapidly adjust their production plans in response to external demand changes. The SMS is then challenged by changing production plans or orders. Therefore, to improve the rapid responsiveness and adaptability of SMSs to internal uncertainties and external demand changes, it is necessary to co-optimize the release control and production scheduling, i.e., the SMS needs to adjust its release control and production scheduling strategies according to the real-time production state to adapt to the changing environment. However, integrated dynamic release control and production scheduling research on SMSs are still very lacking.

In terms of dynamic release control and production scheduling methods, traditional optimization methods include operational research methods (e.g., integer programming), intelligent search algorithms (e.g., genetic algorithms), and heuristic dispatching rules [3, 5, 32]. Operational research methods depend on accurate mathematical modeling, which is difficult for real large-scale complex production lines. The intelligent search algorithm-based method is ineffective in solving dynamic scheduling problems, because it needs to be remodeled, recoded, and researched after environmental changes occur. Dispatching rule-based methods have been widely applied in SMSs due to their rapidness and simplicity [5]. However, it is well known that a single rule focuses on only one performance criterion, and no rule outperforms all others under any objective.

With the application development of big data, CPS, IoT, and other information technologies in the manufacturing field, it is possible to realize real-time interactions with the production system and acquire and store vast amounts of production data [22]. The use of machine learning algorithms has recently aroused increasing interest in solving the dynamic scheduling problems of SMSs by learning from data and selecting the most appropriate scheduling rule [21]. Supervised learning algorithms are used to train a scheduling model from a large number of optimal samples, that is, the mapping relationship between the production state and the optimal or approximate optimal scheduling rule. The scheduling model was then applied to the manufacturing system to realize the online selection of the optimal or approximate optimal scheduling rule. Although this kind of methodology can deal with large-scale problems and improve the real-time and adaptability of the production scheduling system, there are still several limitations as follows.

(1)
The dynamic adjustment only for production scheduling cannot fully deal with the external demand changes. It is necessary to co-optimize the release control and production scheduling of SMSs, but the related research is relatively lacking.
(2)
The essence of the machine learning-based dynamic scheduling method is how to select one appropriate scheduling rule according to the production status. However, a single rule focuses only on one performance criterion and no rule outperforms all others under any objective. Composite dispatching rule (CDR) adopted in this paper is the combination of multiple heuristic dispatching rules by linear weighting, which is able to consider multiple objectives.
(3)
The validity of scheduling model depends on the quality of sample set to a great extent. In practice, it is difficult to obtain sufficient optimal samples, i.e., labeled samples. Most of the existing works use a traversal simulation-based method to obtain optimal samples. The method is to first traverse all scheduling rules under different states through the manufacturing system simulation model and obtain the original data, and then select the optimal data to form the final sample set. However, especially for the CDR with continuous weight parameters to be decided, the above traversal simulation-based method is not applicable. Reinforcement learning (RL) does not need labeled samples. And as deep learning has enhanced the large-scale problem-solving ability of reinforcement learning, scholars have renewed their interest in deep reinforcement learning (DRL) to solve production scheduling problems in the past 2 years [20].

This paper therefore introduces DRL to address the integrated dynamic release control and production scheduling problems of SMSs with external urgent orders and internal uncertain processing times, and proposes a convolutional neural network (CNN)- and asynchronous advanced actor critic (A3C)-based method called CNN-A3C. In this method, the CNN network is the scheduling model with state inputs and scheduling decision outputs. The scheduling decision consists of parameters of the release strategy and weight parameters of CDRs. Therefore, the decision space is a continuous space. The A3C algorithm proposed by Mnih et al. is adept at solving problems with continuous decision spaces [17]. Thus, in the proposed method, the A3C algorithm is used to train the CNN network. The network is then applied to online scheduling to periodically output the most suitable parameters of release strategy and CDRs according to the real-time production status.

Compared to the existing studies, the main contributions of this paper can be summarized as follows.

(1)
This paper studies the integrated dynamic release control and production scheduling problem of SMSs with external urgent orders and internal uncertain processing times, and first applies the combination of DRL and CNN to address this problem. We propose a CNN-A3C-based release control and production scheduling method. In this method, the A3C algorithm is used to train the CNN network as the scheduling model with state inputs and scheduling decision outputs.
(2)
To make the A3C algorithm more efficient on scheduling problem, we further improve the four key points in the A3C algorithm, state space, action space, reward function, and network structure and design four mechanisms: a slide-window-based two- dimensional state perception mechanism, an adaptive reward function that considers multiple objectives and automatically adjusts to dynamic events, a continuous action space based on CDR and release strategies, and actor–critic networks based on CNN.
(3)
The proposed dynamic scheduling method is achieved to automatically generate the most appropriate release control and scheduling decision according to the current production status, which are successfully applied in a simplified SMS.

The remainder of this paper is organized as follows. The section “Literature review” provides a literature survey of machine learning-based dynamic scheduling approaches, especially DRL-based approaches. The section “Problem definition and modeling” describes the dynamic scheduling problem formulation based on RL. The section “CNN-A3C dynamic release control and production scheduling framework” proposes the CNN-A3C-based dynamic scheduling framework. The section “CNN-A3C-based dynamic release and scheduling implementation” introduces the state space, action space, reward function, and actor–critic network design. The section “Experimental results and analysis” presents and discusses the experimental results. Finally, the section “Conclusion” concludes the paper and briefly explores directions of future work.

Literature review

The literature review outlined in the following text focuses on dynamic scheduling using machine learning and existing approaches that consider RL for applications in dynamic production scheduling. The drawbacks of existing research are further analyzed in several aspects and compared with the proposed methods.

Dynamic production scheduling based on machine learning

In machine learning-based dynamic scheduling approaches, the scheduling problem can be described as a 3-tuple $\{F,D,P\}$, where F is the set of complete production attributes of a manufacturing system and represents the production state, D is a set of scheduling decisions, and P is the performance after one scheduling period using the decision D under the production state F. This method aims to establish a map from the production state F to the optimal scheduling decision $D^*$ that meets the optimal performance evaluation criteria $P^*$. This mapping is also called the scheduling knowledge or scheduling model. Based on this, an approximate optimal scheduling decision can be found to meet a better system performance evaluation criterion under a given production state. Figure 1 shows the general scheduling knowledge training process. As shown in Fig. 1, the general data-driven scheduling approach takes three steps [21]. In the first step, the state feature selection, the key production state features (SF) that are related to scheduling are selected to improve the scheduling efficiency. Since the historical data are not all optimal, the second step aims to select the optimal samples as the labeled samples. Afterward, the third step obtains the optimal samples and mines the scheduling knowledge using a machine learning algorithm.

Although this kind of method can improve the real time and adaptability of the production scheduling system, the following deficiencies remain. It is difficult to obtain sufficient labeled samples from real historical production data [21]. In addition, to some extent, the training efficiency and scheduling knowledge effectiveness depend excessively on the accuracy of feature selection. When debugging the scheduling knowledge mining in step 3, it is often unavoidable to go back to redoing or adjusting the first two steps.

Dynamic production scheduling based on DRL

RL has also been actively applied to solve dynamic scheduling problems to overcome the above defects. RL is a kind of machine learning concerned with how agents ought to take actions in an environment to maximize the cumulative reward. First, different from supervised learning, RL does not need the optimal sample selection procedure. Second, RL is characterized by a sequential decision-making ability. That is, at each scheduling point, the status at the next scheduling point and the final performances are considered, while the supervised learning-based scheduling method can only optimize the performance after one scheduling period.

RL can be classified as policy-based RL (e.g., Policy Gradient) and value-based RL (e.g., q-learning). The policy-based RL trains a probability distribution by sampling strategies and enhances the probability of the action with high return value being selected. The value-based RL is to obtain a value table and select the action with high value according to it. The value of action refers to the expected return reward of action. However, when the problem is very complex and has an infinite number of states and actions, we cannot store the value in a table. Thus, scholars use neural network to fit value tables and probability distributions. It is the essential idea of DRL [13]. The introduction of deep learning (DL), represented by a deep neural network (DNN) with perception makes the state feature selection no longer a necessary step and can solve large-scale scheduling problems.

One of the earliest application studies of RL on production scheduling was from Zhang [33], who proposed a policy gradient-based method to learn domain-specific heuristics for job shop scheduling. In recent years, DRL has received much attention and has recently been employed to solve the dynamic scheduling problem [23]. Table 1 reviews the existing methods for DRL-based dynamic scheduling in chronological order, and summarizes the differences between the aforementioned works and our work.

Table 1 Existing methods on DRL-based dynamic scheduling

Full size table

State-space design

As shown in Table 1, there are two main state forms determined by the state perception method. One is the detailed state of each job and machine; for example, Waschnec et al. used the position of each job and the state of each machine as the state [28]. The second is the production attributes, such as the number of workpieces in processing (WIP) and the buffer queue length. In the work from Shiue et al., 30 system attributes were used to describe the state [23]. The first form was a more detailed and comprehensive modeling of the system state. However, it has the obvious drawback of being unable to handle uncertain problems with varying numbers of jobs. The second form could overcome the above drawback, but detailed information was lost [16]. Therefore, it is necessary to study a state perception method that can not only obtain enough detailed information but also address changes in the job count.

Action space design

As shown in Table 1, there are two main action forms. The job to be operated indicates which job the machine chooses to operate. The SDR (single dispatching rule), MDR (multiple dispatching rule), and CDR (composite dispatching rule) are all dispatching rule-based methods. The SDR indicates that the machines in the system all adopt the same dispatching rule. The MDR indicates that the machines in the system adopt different dispatching rules [23]. The CDR is the combination of multiple heuristic dispatching rules by the linear weighting method and is able to consider multiple objectives. Dispatching rules are broadly applied to solve real-world optimization problems when the computation time is limited and the problem size is large. Some common heuristic dispatching rules are as follows [17]: first in first out (FIFO), shortest remaining processing time (SRPT), earliest due date (EDD), and critical ratio (CR). The sequence of operations to be processed is determined by the priorities calculated by the dispatching rule. It is well known that a dispatching rule focuses on only one performance criterion, and no rule outperforms all others under any objective.

Reward function design

As shown in Table 1, most works adopted the fixed reward function that indicates an unchangeable reward calculation procedure. Several works have been conducted on designing the adaptive reward function to improve the adaptability of the reward function. For example, Waschnec et al. designed a two-phase reward function and proposed a multiagent DQN-based global scheduling approach. They divided the training process into two phases. Only one DQN agent was trained in the first phase, and the other machines were controlled by heuristics. In the second phase, all machines were controlled by DQN agents that learned separately. To satisfy the different demands of these two training phases, they designed a two-phase reward function [28]. The reward function needs to be adjustable according to the production state to adapt to the demand changes.

Network structure

As shown in Table 1, the artificial neural network (ANN) is adopted in most related works, and generally to fully connected neural network. In these works, the state space is generally one-dimensional. Several works have began to design complex state space to improve situational awareness. The network architecture is also improved. For example, Hu et al. [8] model the flexible manufacturing system using Petri nets and then employ a graph convolutional network (GCN). GCN is a kind of CNN. Compared with traditional CNN, it can deal with unstructured input. CNN architecture is very important for its performance [30]. The typical structures include Lenet-5 [12], AlexNet [9], GoogLeNet [26], VGG-Nets [24], ResNets [7], etc. In addition to the above hand-crafted CNN structure, many scholars focus on the neural architecture search method. For example, Xue et al. [30] propose a self-adaptive mutation neural architecture search algorithm. Xue et al. [29] study a multi-objective evolutionary approach for neural architecture search to design accurate CNNs with a small structure. O’neill et al. [18] design a genetic algorithm to discover skip-connection structures on DenseNet networks. Recently, CNNs have achieved great success in the field of speech recognition, image recognition, and natural language processing [11]. However, the application research in scheduling field is still relatively lacking.

Most related works adopted ANNs that indicate the fully connected neural networks. Several works have been conducted on designing the adaptive reward function to improve the adaptability of the reward function. For example, Waschnec et al. designed a two-phase reward function and proposed a multiagent DQN-based global scheduling approach. They divided the training process into two phases. Only one DQN agent was trained in the first phase, and the other machines were controlled by heuristics. In the second phase, all machines were controlled by DQN agents that learned separately. To satisfy the different demands of these two training phases, they designed a two-phase reward function [28]. The reward function needs to be adjustable according to the production state to adapt to the demand changes.

Research on the application of DRL to production scheduling problems is still in the exploratory stage [19]. The task-specific features and object-related parameters limited the universal method framework. However, some common points can also be found. As seen from the first five columns, most recent works employed deep Q-learning (DQN) due to the rule-based discrete action space. As mentioned above, no single rule outperforms all others under any objective. This paper uses a composite dispatching rule (CDR) with weights, so that the action space becomes a continuous space. On the other hand, the scheduling problem with continuous action space is incompatible with DQN. The asynchronous advanced actor–critic (A3C) proposed by Mnih et al. can operate over continuous action spaces, constituting an actor–critic, model-free algorithm based on the deterministic policy gradient [17]. Therefore, this paper employs A3C for the dynamic production scheduling. Unlike the existing DQN-based scheduling methods, whose decision is the single dispatching rule, this paper improves the A3C to generate the composite dispatching rule to solve the dynamic scheduling problem of SMS.

Motivated by the above-mentioned remarks, we present a CNN-A3C-based dynamic release control and production scheduling framework. Based on this, a slide-window-based state perception mechanism is first designed, so that the state contains enough detailed information and is able to handle varying numbers of jobs. A state with a two-dimensional spatial structure is observed, which can be successfully handled by deep convolutional neural networks (CNNs). Second, the action space is designed as a combination of composite dispatching rule (CDR)-based continuous scheduling actions and release strategy-based discrete releasing actions. To deal with this combined action space, we improve A3C to training networks and design an action decoding system that converts the output actions into decisions that can be executed by the SMS. Third, we also proposed an adaptive reward function that considers multiple objectives and automatically adjusts to dynamic events. Finally, the proposed CNN-A3C method is verified on a semiconductor manufacturing line benchmark in terms of various performance criteria, on-time delivery date rate (ODR), and mean cycle time (MCT). It is proven that the proposed method outperforms the dispatching rule-based method and the other DRL methods through comparative experiments under uncertain processing times and urgent orders.

Problem definition and modeling

Dynamic release control and production scheduling problem definition

Let $I_0$ denote the scheduling problem with a certain environment. There are o orders of production in a semiconductor system, denoted as $O_1, O_2,\cdots ,O_o $. Each order has one type of product, and the type and the number of products may vary between orders. $N_o$ and $D_o$ represent the product count and the due date of order $O_o$, respectively. There are m machines and M workstations in the SMS. Production activities are performed according to the release strategy of the system and the CDR of each machine in this paper. And the CDRs of machines in the same workstation are the same. Thus, the aim of $I_0$ is to find a set of optimal parameters of release strategy and CDR, described as $\varvec{a^*}=\{{a_1^0,a_2^0},a_1^1,a_2^1,\cdots ,a_k^1,\cdots ,a_1^M,a_2^M,\cdots ,a_k^M\}$. The first two bits $\{{a_1^0,a_2^0}\}$ are parameters of release strategy, and the other bits represent the production scheduling decision. The detailed explanation is described in section “Action space design”.

However, uncertainties always occur during the production process. The urgent order and uncertain processing times of jobs are considered in this paper. Let I denotes the dynamic scheduling problem. u represents the urgent order, and $G_u$, $N_u$, and $D_u$ represent the generation time, product count, and due date, respectively. The goal of dynamic scheduling is to adjust parameters of the release strategy and CDRs in real time according to the current state to protect production performances from the occurrence of disturbances. The dynamic scheduling problem I add the concept of time. The production state at time t is denoted as $s_t$. The goal of I is to decide the most suitable $\varvec{a_t}$ according to the current state $s_t$ in real time.

Assumptions are made of this SMS as follows. And the used notations are defined in Appendix Tabel 8.

Each order has one type of product, and the type of product may vary between orders.
Different types of products have the same processing flow but different processing times.
Not all jobs are available at the initial time. The arrival of jobs is determined by the release control strategy.
All machines are available at the initial time and never break down.
The operations of a job are independent of each other. The operations of one job are independent of those of other jobs.
No semifinished product is scrapped.
No job needs to be reworked.
Urgent orders may be inserted during the production process.
The processing times of operations are uncertain, and their expected values are known.

Dynamic release control and production scheduling problem modeling based on RL

In the RL-based scheduling approach, the problem is described as a Markov decision process with a 5-tuple representation ${<}S,A,P,\gamma ,R{>}$ [16]. S is the state space, A is the action space, P is the state transition probability, $\gamma $ is the discount factor, and R is the reward. In the Markov decision process, the next state $s_{t+1}$ of the system is only related to the current state $s_{t}$ and state transition probability P, and has nothing to do with the past state $s_{t-1}$. The RL agent interacts with the SMS following a policy ${\pi }$, which is a mapping from S to A, $(S \rightarrow A)$, as shown in Fig. 2. In this paper, the policy is a neural network.

As shown in Fig. 2, t represents the scheduling point. There are T scheduling points in this production process. In general, the value of T is large; the value of the scheduling period is so small that the scheduling decision-making is real time. At each decision point t, the RL agent chooses an action $\varvec{a_t}$ according to the current state $s_t$, after which the state changes into a new state $s_{t+1}$ and receives an immediate reward $r_t \in R$. The objective of the dynamic scheduling RL agent is to find a policy ${\pi }^*$ that maximizes the expected sum of long-term rewards, as shown in Eq. (1), wherein $\gamma $ is the discount factor

$$\begin{aligned} \pi ^{*}:=\max _{\pi } \mathbb {E}\left( \sum _{t=1}^{T} r_{t} \cdot \gamma ^{t}\right) . \end{aligned}$$

(1)

In addition, at each scheduling point, the decision $a_t$ is required to respond to the current changes in the environment and to take into account the long-term overall performance criterion. Therefore, the dynamic scheduling problem can be described as Eq. (2)

$$\begin{aligned} I=\left\{ \begin{array}{l} \text {opt}\ p\left( \pi \left( \varvec{a_{t}} \mid s_{t}\right) \right) \\ s . t . \varvec{a_{t}} \in A, s_{t} \in S, t \in T, \end{array}\right. \end{aligned}$$

(2)

where there are T scheduling points in the product process, $\varvec{a_t}$ represents the scheduling decision (the composite dispatching rule) at the tth scheduling point, $\varvec{a_t} \in A$, and $\pi \left( \varvec{a_{t}} \mid s_{t}\right) $ is a neural network and determines $\varvec{a_t}$ according to the current state $s_t$ at each scheduling point, $s_t \in S$. Thus, the purpose of dynamic scheduling is to obtain the optimal policy $\pi \left( \varvec{a_{t}} \mid s_{t}\right) $ that meets the optimal performance p. p represents the accumulated performance criteria over the whole production process T.

As can be seen from Eqs. (1) and (2), the immediate reward $r_t$ needs to reflect production performance, so the design of the reward function is particularly important. In addition, the design of the state space and action space are also two key tasks in this RL-based dynamic scheduling problem.

CNN-A3C dynamic release control and production scheduling framework

This section proposes a CNN-A3C dynamic scheduling framework that consists of the actual manufacturing system in the physical space and the manufacturing system simulation module, action decoding module, training module, and deployment module in the cycle space, as shown in Fig. 3.

Table 2 Description of modules of the CNN-A3C dynamic scheduling framework

Full size table

(1)
The manufacturing system simulation module has N simulation models that provide the environment interacting with subnetworks. In this paper, these simulation models are established based on the discrete event simulation method. In addition to the basic function of reflecting the production logic, they also need to be able to transmit the production state and performance index data to the subnetwork and execute the output scheduling strategy of the subnetwork in real time.
(2)
The training module is the crucial part of this method and involves a global network and N subnetworks. Each subnetwork has the same network structure as the global network. Each subnetwork interacts with the environment independently to update the parameters of the global network asynchronously. The interaction of each subnetwork is implemented as a thread. These N threads do not interfere with each other and run independently.
(3)
The action decoding system translates the action from the network into a schedule that can be executed by the production system directly.
(4)
The deployment module is the online application of scheduling knowledge. The function, input, and output of these modules are shown in Table 2.

CNN-A3C-based dynamic release and scheduling implementation

Based on the above dynamic scheduling framework, this paper further focuses on the training module and proposes a CNN-A3C-based dynamic scheduling method. The three crucial elements in the application of A3C, state space, action space, and reward function are designed in this section.