Introduction

The advent of Industry 4.0 (e.g., Moeuf et al. 2018) and the coming of age of the internet of things (IoT), artificial intelligence (AI), and machine learning (ML) (Cohen et al. 2019b) have created a major opportunity for integrating smart control and maintenance of manufacturing processes (Cohen et al. 2019a; Lu et al. 2016; Voisin et al. 2018; Zheng et al. 2018; Zhong et al. 2017a, 2017b). A framework for the sustainable control and maintenance of production equipment would contribute to the efficiency of these advanced manufacturing systems (Jasiulewicz-Kaczmarek and Gola 2019).

Many researchers advocate for such an undertaking (e.g., Akkermans et al. 2016; Bokrantz et al. 2019; He et al. 2019; Kumar and Galar 2018; Lu et al. 2016) and many papers have suggested frameworks that integrate monitoring, control and maintenance (see Table 1). However, recent rapid advances in technology create a need for an updated model that is fully aligned with current Industry 4.0 concepts and technologies.

Table 1 Comparison of the proposed framework to other selected references

This paper presents a smart, holistic, process-controller framework, designed for current Industry 4.0 shop floors. The framework spans the monitoring, control and maintenance of a single, smart, process controller. The framework is described as a complete package that contains all the modules for a single controller. It is therefore implemented as a unified control solution. In such a setting, the issues of interoperability, data/information, representation, and exchange formats are related to the single controller interface with the external digital world. Consequently, a special smart gateway deals with these issues, as discussed and explained in “Discussion on implementation and validation” section.

The big challenge is devising a framework compatible with Industry 4.0 and Cyber Physical Systems (CPS) concepts and practices. Examples of Industry 4.0 and CPS concepts are: self-awareness, self-diagnosis, self-prognosis, and self-healing. Examples of these practices are the use of artificial intelligence (AI) techniques such as machine learning (ML) and case-based reasoning (CBR), the use of internet of things (IoT) communications, the use of digital twins, and the use of predictive maintenance.

It is assumed that a holistic framework compatible with the above Industry 4.0 concepts, and practices can be devised, which includes the following six components: (1) automatic process control (APC); (2) statistical process control (SPC); (3) recurrent machine learning (RML); (4) smart process diagnosis; (5) smart process prognosis for predictive maintenance and intervention; (6) interaction platform with humans, the manufacturing system, and external IoT.

Several other frameworks have proposed some parts of this suggested approach, and each of them has its limitations and drawbacks. Saif et al. (2011) suggested a fuzzy integrated Statistical Process Control/Automated Process Control (SPC/APC) scheme for controlling process quality and robustness. However, the scheme did not include any maintenance aspects, and fuzzy implementation is not suitable for many (if not most) SPC/APC systems. Ben-Gal and Singer (2004) and Singer and Ben-Gal (2007) proposed an SPC methodology based on Markov models and context modeling of finite-state processes with engineering process control (EPC) to monitor nonlinear and finite-state processes that often result from feedback-controlled processes. Siddiqui et al. (2015) illustrated the integration of SPC and EPC into a heating process, but the proposed framework only enabled low-level control over the process. The frameworks described by both Algabroun et al. (2017) and Mishra and Mungi (2018) were extremely general. For example, Algabroun et al. (2017) only named the general steps ("abnormality detection", "diagnosis", "prognosis", etc.), with no further details. Similarly, Mishra and Mungi (2018) proposed a "cooperation of several systems with regards to manufacturing, overhauling, assembly, ecology, economics, society, and environment". However, Mishra and Mungi (2018) did not explain how to implement their suggestions. Jantunen et al. (2018) did not propose a new framework, but rather presented the European research initiative to solicit such a framework—the ECSEL-MANTIS project—and described how it aims to revolutionize maintenance by using CPS to attain Maintenance 4.0. In recent decades, advances in self-x capabilities (self-awareness, process diagnosis, self-prognosis, self-healing, etc.) have brought process control closer to process maintenance (Bokrantz et al. 2017; Dutt et al. 2016). Barco et al. (2012) described a framework for self-healing in wireless networks. However, this framework cannot be extended beyond the realm of wireless networks. Vassev and Hinchey (2011) reviewed awareness in state-of-the-art autonomic systems and service-component ensembles. Emmanouilidis and Pistofidis (2010) discussed the prospects for achieving machinery self-awareness using wireless sensor networks, as well as the potential for sustainable machinery operation. Based on these examples, there is clearly a gap between the existing literature and the desired framework for smart control and maintenance. This paper fills this gap by presenting a comprehensive framework that integrates machine-learning techniques into smart control and maintenance. While IoT connects everything to everything (Cohen et al. 2019b; Perera et al. 2014), the cheapest and fastest method of dealing with abnormalities is to equip systems, or even subsystems, with the ability to monitor, diagnose, and heal themselves (Seebach et al. 2010; Moeuf et al. 2018). Some recent papers present advanced usage of ML for controlling processes (e.g., Kanawaday et al. 2017; Simba et al. 2018; Shang and You 2019).

Individual state-of-the-art ML papers do not, however, present a full holistic control framework. The purpose of this paper is to present a holistic framework that will suit current and future generations of ML techniques. This paper does not, and should not, compete with any ML-specific technique.

The proposed framework focuses on the most challenging part of the control system: the smart controller logic. See Fig. 1, which depicts a typical block diagram of a manufacturing process control system in which a smart controller is embedded.

Fig. 1
figure 1

A smart controller in a block diagram of a typical manufacturing process control system (arrows signify signals and data)

The proposed framework assumes that the sensors, and other sub-systems and systems, practice self-awareness and maintain their own reliability. In developing the methodological framework of the controller, we integrated APC and corrective actions, where such actions are undertaken either as part of the smart control or by a human expert.

The rest of the paper is structured as follows. Section 2 presents a literature review. Section 3 introduces the proposed smart-controller framework; Sect. 4 presents implementation showcase of the framework for its validation. Section 5 discusses the implementation and validation of the framework. Finally, sect. 6 concludes the paper.

Literature review

The literature review mainly focuses on models and framework suggestions that combine process monitoring and maintenance. Not surprisingly, the models have many things in common with one another. However, most models and frameworks are not at all concerned with self-awareness, or self-diagnosis, or self-prognosis. Therefore, the lessons learned from these models are still innovative when applied to a single controller. Some models combined process monitoring and maintenance long before the age of Industry 4.0. For example, the international standard ISO 13,374 has six steps: data acquisition, data manipulation, state detection, health assessment, prognostic assessment and advisory generation. Iung et al. (2009) suggested a conceptual framework for e-maintenance. This framework encompasses a business process view of activities related to e-maintenance. They describe four major consecutive activities that lead to maintenance decisions: (1) to acquire and process signals; (2) to monitor data and diagnose; (3) to prognosticate; (4) to support decisions. Atluru et al. (2012) propose a supervisory framework that integrates process planning, health maintenance, and tool condition monitoring. The machine controller receives information and commands from a separate manufacturing process monitoring module. This monitoring module interacts with the process planning module, with human supervisors, with other controllers, and external sensors. The model presented was very much tailored to the CNC case study presented. Siddiqui et al. (2015) proposed integration of multivariate statistical process control and engineering process control—a novel framework. Chang et al. (2016) integrates SPC, EPC, and pattern recognition of artificial neural networks (ANN) for system process monitoring, fault diagnosis, and automatic system control. This approach, while simple, is a significant step towards adding system diagnosis to the automated process control. Wang (2016) proposes a framework for intelligent predictive maintenance in Industry 4.0 settings. He describes a general architecture for a shop floor or manufacturing system, in which the activity of the controllers is limited to actuation and data collection. All the data-related activities are carried out in the cloud away from the controllers. These include data processing, calculated process compensations, diagnosis, prognosis, decisions and actions. A similar approach is taken by Terrissa et al. (2016) where all the supervisory control functions would be services on the internet.

Lee et al. (2015) proposed a cyber-physical systems architecture for Industry 4.0-based manufacturing systems. Their model includes the following five major levels: (1) Smart connection (related to IoT); (2) Data to information conversion (data processing); (3) Cyber level (twin models and state characterization); (4) Cognition level (updating human supervisor, and performing analytics and decision making); (5) Configuration level (actions such as self-configure and self-adjust). This model is very general and in that sense analogue to the historical 7-layer OSI model (Open Systems Interconnection) of the internet. In that sense, our paper proposes how to implement most of this general layer architecture for a single process controller. The similarities and differences between this paper and the proposed model appear in Tables 1 and 2.

Table 2 Modules versus functions in various process supervisory models

Wu et al. (2017) develops a fog-based computational framework that enables remote real-time sensing for shopfloor control. Case studies provide proof of concept of their model. The framework consists of four integral elements: a workflow, sensor networks, communication protocols and predictive analytics. The workflow includes: (1) data collection; (2) data streaming; (3) cloud-based diagnostic and prognostic modeling; (4) application of the diagnostic and prognostic models. The work is focused on the issue of processing massive amounts of real-time data with minimal latency. While Wu et al. (2017) propose sampling as a means of reducing the amount of ML processing, our approach is different: we advocate close local monitoring that only transfers data that is deemed or is suspected to be related to an exception. Comparison of similarities and differences between Wu’s paper and our proposed model appears in Tables 1 and 2.

Both Matyas et al. (2017) and Ansari et al. (2019) propose a circular model with four steps: (1) data acquisition and re-processing; (2) data analysis and simulation; (3) reaction model; (4) prescriptive maintenance decision support. Algabroun et al. (2017) propose "maintenance 4.0" framework. This framework divides a manufacturing system to: (1) managing system; (2) change management; (3) mediator; (4) managed system. While the sensors and actuators (and obviously the controller) are in the managed system, all the data processing monitoring and decisions are carried out in the "change management" part. Within the "change management" part, there are four modules: (1) monitor; (2) analyzer; (3) planner; (4) executor. Their framework lacks significant use of AI, and lack the capability for continual model adjustment.

Peres et al. (2018) propose the "Intelligent Data Analysis and Real-Time Supervision" (IDARTS) framework. IDARTS is composed of three main components: (1) "Cyber Physical Production System" (CPPS)—digital twin of the shop floor with digital representation of its entities; (2) real-time data analysis (predictions and visualization for evaluation and decision making); (3) knowledge management (analytics). It is interesting to note that data acquisition and pre-processing are done in the CPPS and that the pre-processed data goes through the real-time module and the knowledge management module before coming back to the evaluation & decision-making part of the CPPS. A comparison of similarities and differences between this paper and the proposed model appears in Tables 1 and 2. Ge (2018) describes a distributed predictive modeling framework for prediction and diagnosis of key performance indices in plant-wide processes. While this paper does not even mention Industry 4.0, it supports the approach of distributing control of the shop floor to the various process controllers. Mishra and Mungi (2018) suggest a sustainable maintenance framework which will "incorporate cooperation with numerous other frameworks, with respect to manufacturing, overhauling, assembly, ecology, economics, society, and environment while carrying out the maintenance act."

In addition, numerous papers have proposed frameworks for other subjects (not manufacturing process control) that have some overlap with the proposed framework. For example, Barco et al. (2012) proposed a unified framework for self-healing in wireless networks. Golan et al. (2019) proposed a framework for operator–workstation interaction in Industry 4.0; Amini and Chang (2018b) proposed a process monitoring framework for 3D metal printing on an industrial scale. Franciosi et al. (2020) undertook a systematic literature review on measuring the impact of maintenance. They suggest a three-dimensional framework model, where the dimensions are: maintenance process, sustainability category, type of impact. A comparison of selected references is presented in Table 1.

The smart controller framework

This section describes the proposed smart controller framework and its main components. The proposed methodological framework is based on the following CPS characteristics: (1) It includes the physical manufacturing process; (2) It controls that process; (3) It involves intensive computations including extensive use of AI; (4) It includes real -time communications with the process related sensors and actuators; (5) It has a gateway through which it communicates with the manufacturing network, the internet, and the operator. Other CPS characteristics are related to "Self-x" technology: self-awareness, and self-diagnosis, self-prognosis and self-healing. Each level in this hierarchy requires input from its predecessor in order to perform its function. (Cohen et al. 2019a; Dutt et al. 2016; Seiger et al. 2018). Recently, the use of digital twin has been closely associated with CPS and Industry 4.0. A digital twin for the manufacturing process and its controller is part of the suggested prognosis and healing module. Finally, AI techniques have become the hallmark of CPS computations and Industry 4.0 intelligence. The proposed framework is heavily based on AI techniques such as machine learning (ML), root-cause analysis (RCA), case-based reasoning (CBR) (Ruschel et al. 2017). However, the aim of the paper is to provide a generic holistic framework that would suit a wide range of ML techniques for current and future generations of ML methods; this framework does not, and is not intended to, compete with any ML-specific technique.

As interaction and interoperability are crucial elements in CPS and Industry 4.0 manufacturing systems, an external interaction module is dedicated to interacting and solving interoperability issues. While the Industry 4.0 environment is highly automated, it typically supports human operators rather than replaces them (Golan et al. 2019; Mattsson et al. 2020). Accordingly, we assign high importance to attributes such as human knowledge, experience, flexibility, skill, wisdom, and judgement. Therefore, the external interaction module includes a special sub-module for human interaction.

Figure 2 depicts the four major modules of the framework, and the relationships between them. The main modules are designated by rectangles, information flows are designated by continuous arrows, and interventions are marked by dashed arrows. The framework is implemented on a smart controller and has four main modules:

  • Control & awareness module: The word "control" represents the element of automated process control (APC), while the word "awareness" refers to process monitoring using SPC and ML.

  • Process-diagnosis module: This module is invoked only when a process abnormality is detected or when the process is drifting or is out of control. It analyzes the process with the purpose of identifying: (1) changes in process factors, (2) new process factors, (3) process drifting, and (4) problems. In the case of identifying problems, the module should support root-cause analysis (RCA). Popular RCA methods are case-based reasoning (CBR) and big-data search.

  • Prognosis & healing module: This module produces a prognosis and decides on action items, which may do one of the following: (1) automatically intervening, (2) asking for a human decision, or (3) doing nothing. In the case of automated intervention, the module performs the automatic intervention, and then sends minor modifications to the ML weights, or updates to the ML model in the control and awareness module. When significant system changes occur in the behavior or the environment, the prognosis and healing module revises the ML model. Each such revision generates a new ML model that replaces the existing ML model. This process of creating new ML versions is referred to as recurrent machine learning (RML).

  • External Interaction Platform module. This module is responsible for all interaction between the controller and external entities except for the process sensors and actuators. The communication is with a human operator, other machines and processes on the shop floor, and the internet. To communicate with other digital entities, the module deploys a smart gateway. Communications with a person are mainly done for the following purposes: (1) conveying alarms and other information to the human operator, (2) implementing intervention orders (e.g., shut-down) from the operator, (3) receiving information supplied by the human operator, and (4) ML-supervised training.

Fig. 2
figure 2

The proposed smart process control framework and its main elements

The modules communicate by sending and receiving information, as depicted by the nine numbered arrows in Fig. 2. A brief description of each of these arrows follows:

  1. 1.

    Arrow 1: The process sensors transmit directly all the relevant data and information to the control & awareness module. These data include product measures, process measures, and context-related data.

  2. 2.

    Arrow 2: Carries information from the control & awareness module to the process-diagnosis module. The information includes all relevant data for diagnosis, including granular history of the process parameters (product measures, process measures, and context related data), fault detection and anomalies detection, and controller-awareness information.

  3. 3.

    Arrow 3: Carries (and shares) information from the process-diagnosis module to the prognosis & healing module. This information is needed for prognostics and simulating the effects of various compensation or correction alternatives, for continual correction or compensation, or for automated configuration changes.

  4. 4.

    Arrow 4: Carries information gathered and processed by the process-diagnosis module to the human interaction module. On the one hand, it carries the information related to the ML-supervised training stage; this information is crucial for getting expert supervisor feedback. On the other hand, it includes warnings regarding the appearance of new factors and changes in factor values. It also includes any information that may spur a human decision or intervention.

  5. 5.

    Arrow 5: Carries human feedback from the human interaction module to the process-diagnosis module. Includes feedback intended for the supervised ML model.

  6. 6.

    Arrow 6: Carries human feedback from the External Interaction Platform to the self-healing module. This information is important for improving or correcting the automatic intervention of the self-healing module.

  7. 7.

    Arrow 7: Carries information from the self-healing module to the human interaction module (to be passed on to the human operator). This information includes prognosis warnings and alarms (if any), self-healing intervention details, as well as queries related to feedback and to guidance for the self-healing module.

  8. 8.

    Arrow 8: Carries all the information and queries from the controller modules to the human operator. The arrow originates at the External Interaction Platform module, which is an intermediary instrument for conveying data, information, and queries, to and from the operator.

  9. 9.

    Arrow 9: Carries information from the human operator to the External Interaction Platform, which can then be conveyed to the various modules. For example, human feedback related to the supervised training stages of the RML process is passed on to the process-diagnosis module by Arrow 5, while Arrow 6 conveys human feedback to the self-healing module.

  10. 10.

    Arrow 10: Carries information from the External Interaction Platform to the Industrial Internet of Things (IIoT). This includes continual posting of current process status, replying to various queries and requests (providing process-related information and history), and sending queries to various IIoT entities.

  11. 11.

    Arrow 11: Carries information from the Industrial Internet of Things (IIoT) to the External Interaction Platform. This information may be responses to queries, some shop-floor supervisory information or system orders.

  12. 12.

    Arrow 12: To provide the process-diagnosis module with information that may be relevant for most of its various diagnosis activities. Arrow 12 carries information from the shop floor and its related systems through the Industrial Internet of Things (IIoT) directly to the process-diagnosis module.

  13. 13.

    Arrow 13: Carries information from the external Internet/Web to the process-diagnosis module, mainly to support the big-data search, the CBR and RCBR. This information includes cases of other similar controllers in similar processes to expand the learning and experience of the single controller. It may also include some healing recommendations for various situations.

Intervention arrows (dashed) designate either: (a) changes to the process actuators and/or switches initiated either by the human operator or by the self-healing module; or (b) major changes to the ML model (in the control and awareness module) initiated by the process-diagnosis module. Human operator interventions are communicated through an intervention interface on the human interaction module. These are passed directly to the control and awareness module for implementation. The self-healing module initiates interventions for automatic process tuning and correction, or in some cases, for changes to the manufacturing system’s configuration. The remainder of the paper explains each of the modules in detail.

Control and awareness module

Automated manufacturing processes must be closely monitored and controlled to ensure acceptable quality and to identify failures and problems. The established classical technique for carrying out monitoring and control is automated process control (APC). We incorporate APC in the proposed framework as its lowest level. At the next level we incorporate data granularity, which is required for effective conversion of data to information. Different uses of granularity may dictate different granularity levels. For example, "process state" may have different granularity than parameter granularity. In addition, in the proposed framework, collecting information regarding process quality is part of the awareness of this module and enables determining whether a further action is in order. Thus, we extend monitoring and control beyond the elementary feedback-loop using statistical process control (SPC) techniques (enhanced SPC may be carried out by ML). Combining SPC, APC, and maintenance of a manufacturing system has many advantages (He et al. 2019; Lu et al. 2016; Park et al. 2012; Saif 2019). Thus, in our framework, the focus is on combining these methods to identify significant change automatically, and, upon identifying the change, making the necessary updates to reflect reliably the recent dependencies in the data. These methods are implemented in the first proposed module called "process control and awareness". Recently, integrating sensor fusion and an IoT approach has enabled the use of machine learning for better control and for predictive maintenance (Siddiqui et al. 2015). Moreover, the availability of smart sensors and systems has elicited the advent of self-aware systems. However, when integrating SPC and APC, in many cases the presence of autocorrelation, as well as specific patterns in the data, make it impossible to detect and classify the existing fault quickly and accurately, at least when classical SPC methods are employed (Psarakis 2011).

In our framework, therefore, we propose implementing SPC using ML algorithms, which do not assume any pre-defined model and learn automatically the dependencies in the output observations that resulted from the APC (Amini and Chang 2018a; De Ketelaere et al. 2015; Rato et al. 2016; Shao and Hu 2020). Since the word "control" appears in both APC and SPC, we chose to name the framework's first module the "control and awareness" module. The awareness part of the name refers to the ability of the module to distinguish between different states of a process measure (e.g., fluctuating, deteriorating, increasing, abnormal).

Thus, the proposed control and awareness module is composed of the following four parts: (1) APC, (2) data granularity and its storage, (3) ML/SPC, and (4) controller awareness. These parts are depicted in Fig. 3.

Fig. 3
figure 3

The proposed control and awareness module and its main elements

Process-diagnosis module

The control and awareness module is designed to implement an effective compensation process deploying ML. However, in manufacturing systems, it is usually the case that not all data that may affect the controlled target are available, and furthermore, the autocorrelation and patterns within the monitored target can change dynamically, which can significantly reduce the performance of machine-learning algorithms with regard to anomaly detection (Kholerdi et al. 2018). In other words, these models can behave accurately during training, or during a specific monitoring period, as long as the data or patterns do not change dramatically. In other testing periods, however, the test error or false-positive rate can increase due to over-fitting. Thus, a self-awareness ability is needed to identify when the models used are no longer accurate, meaning that updated machine-learning algorithms should be derived and implemented (Ollivier 2015; Srivastava et al. 2014).

The process-diagnosis module plays key roles in the functioning of the smart controller. Its first role is to uncover new factors and identify changes in old factors. This means that the module acts as an agent that scans the surroundings of the controller to identify environmental factors, as well as tracking existing factors to detect changes when they occur. The continual feedback that is conveyed to the process compensation ML model makes it a dynamic recurrent learning process (RML). Another diagnostic role is to perform root-cause analysis (RCA), which may require a specialized AI procedure, known as a big-data search, as well as case-based reasoning (CBR), which is done on a continual basis (RCBR). The process-diagnosis module of the smart controller is depicted in Fig. 4.

Fig. 4
figure 4

The proposed process-diagnosis module and its main elements

The process-diagnosis module receives all the data from the controller process-awareness level. This includes the SPC data, APC data, and the sensors’ self-awareness data. In addition, it receives data via IoT from other close-by sensors and controllers. The above data are used for several important purposes: (1) root-cause analysis; (2) discovery of new factors; (3) discovery of new rules; and (4) process state classification. Smart diagnosis necessitates internet accessibility for big-data searches and analytics. Figure 5 depicts the flow of data and information generated by the process-diagnosis module.

Fig. 5
figure 5

The flow of data and information in the proposed process-diagnosis module

The diagnosis stage is the most intensive stage in terms of data and information analysis. The main sources of data and information, as depicted in Fig. 5, are: (1) historical process data; (2) current process data; (3) information related to the neighboring processes and controllers, the information passing through a gateway to ensure interoperability; (4) information related to deep search queries on the internet related to experience gathered in similar processes, in similar controllers, and in similar situations. The unstructured data on the internet raises a host of compatibility and interoperability issues. This information therefore passes through a gateway that must deal with these issues and ensure interoperability; (5) history of diagnosis findings and results.

Internet use and big data searches appear in Fig. 5 and in Fig. 6 and deserve special attention in the age of digitization. An excellent and broad coverage of the use of big data and related technologies in maintenance is given by Baum et al. (2018). A more focused coverage on controllers is provided by Gao et al. (2016). O’Donovan et al. (2015) describe a set of data and system requirements for implementing equipment maintenance applications in industrial environments. They also describe an information system model that uses big data pipeline for integrating, processing and analyzing industrial equipment data. Bumblauskas et al. (2017) describe decision-support system for maintenance based on corporate big data analytics. Yu et al. (2019) describe a big data ecosystem for fault detection in predictive maintenance. Finally, Zhang et al. (2020) achieve automatic anomaly detection using a big data platform of intelligent maintenance. From all these references, it is clear that a special mechanism should be constructed to deal with the interoperability, streaming, and other issues that characterizes use of big-data searches. The natural place for such a mechanism in the proposed framework is within the gateway of the External Interaction Module.

Fig. 6
figure 6

The intervention decision logic of the proposed prognosis and healing module

While Fig. 5 illustrates the flow of the data and the main purposes for which the data are used, it does not convey the broad range of techniques and methods that may be invoked at this stage. Thus, the proposed framework has been designed to allow flexibility in terms of its precise implementation.

Prognosis and healing module

This module performs three major tasks: (1) prognosis, (2) deciding how to proceed, and (3) managing the automated healing (in cases where automated healing is the chosen course of action). This module is the main user of a digital twin of the process controller. The digital twin enables not only an accurate snapshot of the current process situation, but also simulation of the controller system behavior due to hypothetic intervention. The digital twin is a major pillar of cyber-physical systems (CPS) and its use is also a hallmark of Industry 4.0 systems.

The major tasks of the prognosis and healing module are now described in greater detail.

  1. 1.

    Prognosis: This task produces projections of the process and its parameters. A digital twin may be used to perform the projections. The projections may be based on the current state and the trajectory of previous states, or they may be based on a simulation of future behavior. The module can utilize the digital twin to run "what-if" scenarios to determine the effectiveness of interventions. The prognosis module gets the results of predictive maintenance for predicting future process trajectories. The prognosis task should be able to efficiently handle the complex logic of an automated process with a large quantity of input data and should provide support for decision-making at various levels.

  2. 2.

    Deciding how to proceed requires choosing one of three alternatives based on the prognosis: (a) do nothing, (b) seek human interaction, or (c) perform automated healing. This task is described in Fig. 6.

    As depicted in Fig. 6, minor changes or questionable results lead to refraining from action. Changes that need to be implemented manually, and equivocal or questionable candidate factors, that need to be verified by an expert, lead to human interaction. Finally, if there are changes in the existing-factor values, or if automated configuration changes are required, the intervention is handled automatically by the prognosis and healing module. In addition to decisions that are based on the type of identified or required change, Fig. 6 shows that decisions are sometimes made in the light of the identification of new rules, or new factors for process maintenance and improvement.

  3. 3.

    Automated healing: The controller must have absolute control over the related processes of automated healing. Thus, prior to any automated intervention, a prognosis based on this intervention should be computed. The healing process can assume various forms. For example, it can be a reset (or a restart), a shut-down of the system (or just a sub-system), the replacement of a workpiece, or some process activation such as cooling, heating, or lubricating.

The inputs to the prognosis and healing module are based on conclusions and information generated by the process-diagnosis module. This information is composed of identified problems or problematic trends and identified causes for these problems or problematic trends. The healing sub-module must choose the most appropriate method of intervention. In some cases, this decision may be straightforward, but in other cases it may be a complex task. The healing process must be able to identify and compare different alternative treatments. It also needs to run a prognosis based on the selected treatment.

External interaction platform

The External Interaction Platform is the platform from which all human contact is organized and implemented. This platform interacts not only with the human expert, but also with all the modules of the proposed framework. Figure 7 depicts the human interaction module and its main information exchanges.

Fig. 7
figure 7

The main interactions between the External Interaction Platform and the other elements

Special attention should be paid to candidate factors that may be affecting the manufacturing process, also referred to as "undecided factors". Dealing with a candidate factor involves deciding between three alternatives: (1) adding the factor to the models as a new factor; (2) eliminating the factor from consideration; or (3) letting the factor remain undecided. Figure 7 shows these alternatives and their related information flows.

The following definitions are used in Fig. 8

Fig. 8
figure 8

Elicitation, classification and treatment of new factors by the human expert

W—Vector of verified new factors affecting the relevant process.

W'—Vector of candidate, undecided factors (i.e., their effect needs to be verified).

Validation through an implementation showcase

This section describes an implementation of the control and awareness and process diagnosis modules in a smart controller, deployed in a silicon wafers manufacturing process with the main aim of maintaining a desired thickness. The purpose of this section is to serve as a proof of concept and to validate the suggested methodological framework. The showcase not only demonstrates the feasibility of the proposed framework, but also shows its real-world potential and effectiveness.

In the semiconductor industry, silicon wafers are widely used for fabricating integrated circuits. For proper performance of the integrated circuits, the silicon wafers are required to have specific thickness. Thus, monitoring of wafer thickness based on the scan data of a laser probe is usually used (Zhu et al. 2020). A common technique for achieving the desired thickness is fine double-sided polishing (Lee et al. 2009; Schwandner et al. 2014). Following Zhong et al. (2017a, b), the quality and uniformity of the accepted thickness across the entire wafer may be affected by several parameters related to the polishing technique, such as: (i) distribution of pressure, relative velocity and temperature; (ii) flow of polishing fluid; and (iii) properties of the polishing pad. Those parameters are adjusted by a production worker in specific environmental conditions. Thus, different workers’ condition (e.g., fatigue, working hours, attention levels) together with environmental condition (e.g., temperature, humidity) may influence the accuracy of the accepted values compared to the desired values of the polishing parameters (Golan et al. 2019; Strauch 2017).

Figure 9 presents the interfaces between the control and awareness module and process-diagnosis module in a silicon wafers manufacturing process. In the control and awareness module, the process-specific target is the desired wafer thickness. We measure the thickness by scanning each produced wafer with a laser probe, using SPC charts based on machine-learning algorithms. The results of the monitoring are transferred to the automatic process control mechanism, in which the polishing-control parameters’ values are adjusted to correct the differences between the wafer’s thickness and the desired thickness. The monitored thickness values of the wafers, including the anomaly indications from desired thickness target, and values of the polishing parameters, were transferred to the process diagnosis module. In this module, the workers’ and environmental conditions are collected and root-cause analysis is performed to identify case-based reasoning for significant differences between the wafer’s thickness and the desired thickness.

Figure 10 presents examples of monitoring charts, from the control and awareness module, showing the wafer thickness value (our target) and the velocity of the fine double-side polishing, which is one of the automatic controlled parameters. The upper graph presents 25 thickness values of wafers, relative to upper and lower control limits (25 µm and 75 µm respectively), which indicate significant deviation from our desired thickness target value of 50µm. The lower graph indicates the values of the velocity polishing relative to four different levels of velocity intensity (low, medium, high, very high). It can be seen from the graphs that the higher the velocity, the lower the thickness. Furthermore, towards the last thickness measurements in the graph, it can be seen that when the velocity is low, the thickness is high and an out-of-control indication is accepted. Deterioration of the velocity to low values that led to a consistent increase in thickness may indicate that the automatic process control was not effective. These data were transferred to the process diagnosis module, in order to identify the reasons for deterioration of the polishing velocity among the parameters that relate to the workers’ and environmental conditions. Root cause analysis was performed on historical data to find the reasons and changes that lead to deterioration of the polishing velocity that resulted in high thickness wafers, using machine learning algorithms. In our showcase, we would use an ordinal decision tree algorithm (Singer and Marudi 2020; Singer et al. 2020) to apply root cause analysis as a way of identifying which parameters may be influencing the levels of velocity intensity (i.e., low velocity, medium velocity, high velocity, very high velocity). This ordinal machine learning algorithm is suitable for the process-diagnosis module in our example, for two main reasons. First, the algorithm considers the deviations from the target thickness variable for evaluating the desired changes in the level of velocity of polishing the wafer. Secondly, the ordinal decision tree produces an interpretable model, which yields practical insights regarding the relationship between the wafer-polishing velocity level and, the wafer’s defects (Singer and Cohen 2020).

Figure 11 is an example of an accepted ordinal decision tree. This shows that 10% of the time the polishing velocity is low, 30% of the time the polishing velocity is medium, 50% of the time the polishing velocity is high and 10% of the time the polishing velocity is very high. The velocity distribution is presented by (0.1, 0.3, 0.5, 0.1). However, when the level of humidity is high, and the fatigue of the worker is great, the probability of low velocity is 50%. These insights are transferred via the human interaction platform to the production line managers, in order to develop working mechanisms for monitoring and controlling of polishing velocity, in environmental conditions of high humidity and high fatigue.

Fig. 9
figure 9

Control and awareness and process-diagnosis modules of the smart controller in a silicon wafers manufacturing process with a desired thickness

Fig. 10
figure 10

Examples of monitoring charts for wafer’s thickness value and polishing velocity from the control and awareness module of a silicon wafer’s manufacturing process

Fig. 11
figure 11

Ordinal classification tree for root-cause analysis of polishing velocity (low velocity; medium velocity; high velocity; very high velocity) from the process-diagnosis module

Discussion on implementation and validation

In Sect. 4, we demonstrate the applicability of the control and awareness and process diagnosis modules in the proposed methodology framework, deployed in a silicon wafers manufacturing process with the main aim of maintaining a desired thickness. The showcase emphasizes the importance of the interfaces between the modules. We illustrated the use of control charts for monitoring the wafer’s thickness and the parameters that may directly affect its values. Then we presented an example that illustrated the implementation of an ordinal decision tree algorithm as a root-cause analysis tool in the diagnosis module, based on the outputs accepted from the control charts and historical data regarding production worker and environmental conditions. The insights from the diagnosis module are transferred to the production line managers to assist in deriving practical decisions to prevent future wafer defects of this type in the silicon wafers manufacturing process.

Since the framework deals with a single controller of a single process, it is strongly recommended that the framework be implemented as one unified project using the same programming language, same data-format scheme, and same team of developers. The framework could therefore be treated as an embedded software package within the shopfloor. In addition, the framework's internal modules are meant to operate only within the proposed framework and it may be hazardous to try to fit or replace internal modules by existing legacy modules.

With the exception of the interaction with local sensors and actuators and the human supervisor, the framework deploys a smart gateway that enables it to communicate and cooperate with the legacy shop floor systems. The advantage of this, is that implementing the framework as a unified system, enables the controller's software to be embedded in larger systems. As a result, having this controller's software for a certain process may need a small number of adjustments for duplicating it in similar processes.

Interoperability is an issue of great importance to the implementation of any manufacturing control system, including to the implementation of process controllers. Noura et al. (2019) present several types of interoperability.

  1. (1)

    Device interoperability is concerned with (i) the exchange of information between heterogeneous devices and heterogenous communication protocols and (ii) the ability to integrate new devices into any IoT platform.

  2. (2)

    Network interoperability: deals with mechanisms to enable seamless message exchange between systems through different networks (networks of networks) for end-to-end communication.

  3. (3)

    Syntactic interoperability refers to interoperation of the format as well as the data structure used in any exchanged information. Syntactic interoperability problems arise when the sender’s encoding rules are incompatible with the receiver’s decoding rules

  4. (4)

    Semantic interoperability: Even if two different systems have the same data, differences between data models and information models will cause different descriptions or understandings.

  5. (5)

    Platform interoperability: Problems arise with communication between different operating systems (OSs), programming languages, data structures, architectures and access mechanisms for things and data (Noura et al. 2019).

Since this paper focuses on the framework of a single process controller, its implementation must be an integral part of the process-related operating system. Platform interoperability should therefore be dealt with at a higher system level, and is beyond the scope of this paper. The controller system is in large part a real-time system or a near real-time system. This implies that the proposed framework should be implemented as a single unified system and operate on the same network. Moreover, a process controller is typically part of the local manufacturing network. So network interoperability should be dealt at the network level and is beyond the scope of this paper. We assume that the proposed framework is implemented as one integral project with a single programming language and unified treatment of data and information formats, meaning, and manipulation. This is not only reasonable but also expected for single-controller software. Thus, internally to the framework, only device interoperability may be an issue. Therefore, data arriving from devices such as sensors and actuators are pre-processed in the control and awareness module to ensure format compatibility, and correct treatment of the collected data.

Communication with external entities (outside the implemented framework) may indeed be subject to syntactic and semantic interoperability issues. A number of solutions are found in the literature (Jardim-Goncalves et al. 2016). For example, Khilwani and Harding (2016) focus on semantic web concepts and tools that enable computers to automatically process and understand information that both machines and humans can understand. Delaram and Valilai (2017) proposed a solution for manufacturing interoperability fulfillment using interoperability service providers. Kamiński (2020) proposed an integration platform for common standards and interoperability technologies (such as: SOA, XML, Web-Services). Several papers suggested treating interoperability by means of a gateway (Aloi et al. 2016; Vargas and Salvador 2016; Adesina et al. 2019; Jiang et al. 2020). This is also the way we suggest when encountering interoperability in the suggested framework. There should be a software gateway that has the capability to convert the most prevalent IoT standards (e.g., OPCUA (Open Platform Communications Unified Architecture) and MQTT (Message Queuing Telemetry Transport)). The gateway should be able to solve issues of device interoperability, syntactic interoperability and semantic interoperability. The proposed gateway is shown in Figs. 2 and 3.

The proposed control and awareness module is supported by several research studies that focused on the development of an online monitoring and control (OMC) system to detect and isolate defects and to manufacture parts with particular desired qualities (Armini & Chang 2018b; Imani et al. 2018; Yao et al. 2018). A particularly important aspect of the proposed framework is that it may accommodate any combination of ML and SPC techniques for fault detection. This is in line with recent research studies which suggest the use of machine learning algorithms for monitoring purposes, rather than using traditional statistical process control techniques (Bacher and Ben-Gal 2017; Chou et al. 2020; Schuster et al. 2018). There are two main reasons for using such data-driven approaches (Amini and Chang 2018a):

  1. 1.

    They do not assume a particular data distribution in advance, and they cover all possible patterns and dependencies within the data.

  2. 2.

    They can be useful for monitoring manufacturing processes with a high number of dimensions. These are becoming increasingly common due to the proliferation of smart sensors and data streaming, which generate enormous quantities of data daily.

While the proposed framework incorporates only a single controller, it shares many largely general similarities with other proposed frameworks. Table 2 compares the proposed framework to some of the other frameworks in terms of their organization (how the functionality is arranged).

The aim of this paper is to provide a generic framework that will suit a wide range of ML techniques for current and future generations of ML methods. Consequently, it does not and should not compete with any ML specific technique. For the various ML techniques used with fault detection and predictive maintenance see Carvalho et al. (2019), Lo et al. (2019) and Angelopoulos et al. (2020). However, the framework advocates and supports the use of recurrent machine learning capabilities suggested here; these capabilities include dynamically changing the ML model itself (not just weights) and incorporating new factors in the model itself. This approach is relatively new at the time of writing, but its popularity is expected to grow significantly.

The schema of process monitoring and control have been extended in some manufacturing industries to include closed-loop control for defect prevention purposes. For example, in additive manufacturing, several research studies (Chua et al. 2017; Garanger et al. 2018; Mazumder 2015) have introduced closed-loop feedback control so that when a defect is detected (from sensor signatures and process parameters) physical properties such as the laser power or scan speed can be changed to rectify the defect. In the proposed framework, we integrate the approaches of different research groups into one unified system (shown in Fig. 2), thus leveraging the promising results achieved by each individual group. Consider, for example, as proposed in the research study of Garanger et al. (2018), the integration of a feedback control mechanism during the printing of a plastic object made of several parts, each of different infill density (the infill densities are the control variables), using additive manufacturing. The infill density of each part is adjusted in a closed-loop control process to achieve the desired stiffness. However, this is a simplistic example; in practice, when additive manufacturing involves a large number of control variables, it can be difficult to understand the relationships between these variables and the final properties of the part (Mani et al. 2017). In complicated cases such as these, the framework proposed in the present study would search for other variables to monitor. Such a search should yield additional control variables (in addition to the proposed infill density). For example, new variables that the search may yield are laser power, and scan speed, that would enable retaining all the target properties of the plastic objects.

Conclusions

This paper presents a new holistic framework for a smart machine controller which aims to increase quality and reduce downtime. The paper describes in detail the proposed smart machine controller logic and its four software modules. The four main software modules of the controller are described and discussed in detail, along with the interactions between them: (1) Control and awareness module—performs continuous APC and ML for compensation/correction, as well as SPC for monitoring and invoking the process-diagnosis module; (2) Controller process-diagnosis module—performs continual (recurrent) analysis of the process state and trends, detecting new factors and tracking changes in old ones; (3) Prognosis and healing module—performs prognosis and accordingly decides on one of three alternatives: (i) do nothing, (ii) inform a human operator, (iii) intervene. In the third case, automated intervention is performed via parameter changes, re-configurations, and automated maintenance; (4) The External Interaction Platform is an interactive module for interfacing with operators and experts, presenting them with the process analysis information and obtaining feedback from them as part of a learning process.

Sections 4 and 5 validate the assumptions of the introduction that such a framework is feasible. Section 5 discusses the compatibility of the proposed framework with Industry 4.0 and CPS concepts such as self-awareness, self-diagnosis, self-prognosis, and self-healing. We also show that the framework includes Industry 4.0 practices, such as use of machine learning (ML) and case-based reasoning (CBR), the use of the internet of things (IoT) communications, and use of predictive maintenance. The six components assumed to be main building blocks of the framework are incorporated into the four modules as follows:

  1. (I)

    The Control & Awareness Module integrates components: (1) automatic process control (APC), and (2) statistical process control (SPC).

  2. (II)

    The Process-Diagnosis Module utilizes (3) recurrent machine learning (RML) and (4) smart process diagnosis.

  3. (III)

    The Prognosis & Healing module is composed of component (5) smart process prognosis for predictive maintenance and intervention.

  4. (IV)

    The External Interaction Platform manages the interaction with component (6) humans, the manufacturing system, and external IoT.

With the exceptions of the interaction with local sensors and actuators, and the human supervisor, the framework deploys a smart gateway that enables it to communicate with other information systems and cooperate with other legacy systems.

The proposed framework will allow the operators of manufacturing equipment to detect operational problems before a serious situation has time to develop. They can then take corrective action, to restore the process to its proper state and thereby adhere to the recommended guidelines for preventive maintenance. The framework can serve as an invaluable reference for those wishing to implement Industry 4.0 smart control in shop floors. Future research could include pursuing improvements to the current methodological framework, comparing it with future suggested frameworks, and validating it by implementing case studies that adopt either the full framework or parts of it.