1 Introduction

Environmental sustainability has been an important aspect for manufacturing companies [1,2,3], which calls for high availability, quality and utilisation rate of manufacturers’ products [4]. Among other things, natural resource consumption is also an increasingly important issue, which has led to high interest in a circular economy (CE) strategy [5, 6]. To decrease the resource consumption of our societies, sharing manufacturing resources by multiple users combined with the PSS contract offered by industrial firms has been a promising way [7, 8]. Within the PSS paradigm, the sharing business is used by more industrial practitioners to increase the utilisation of their products [9, 10]. For instance, car sharing is a well-known conventional example of a PSS with group’s sharing cars resulting in fewer cars than with individual ownership, but their mobility needs maintained [11]. Meanwhile, with the specialised maintenance and faults diagnosis services of the PSS providers, the availability and reliability of products can be enhanced, thereby benefit the customers.

Currently, the connectivity through the Internet of things (IoT) among products, manufacturers, and users has contributed to sharing business with more convenience and higher value, as it is a fundamental part of any sharing economy [12]. Moreover, sharing business also has the potential for other sectors, such as consumer electronics as a service with IoT techniques [13]. Obviously, the sharing business of the PSS is utilised by an increasing number of manufacturers, which has created opportunities for manufacturers to better control and management of their products.

At the same time, manufacturers’ roles in our societies are also changing, influenced by the growth of sharing economy. A movement of Fabrication Laboratories, which are small workshops that offer tools and services for digital manufacturing to a large number of users, has shown remarkable development [14]. From the environmental sustainability perspective, such manufacturing processes are more accessible to public users and is gaining increased interest in the industrial community [15, 16]. In such a context, sharing industrial facilities for manufacturing is also getting attention in research and practice. For instance, crowdsourced manufacturing, where companies share their manufacturing facilities depending on their demands and capacity, is evaluated by industry [17]. Furthermore, agent-based control combined with a matching algorithm, is proposed to increase the utilising ratio of manufacturing facilities [18]. Interestingly in this article, sharing business can be in general, facilitated by big data analytics (BDA) [19]. Therefore, the potential of sharing industrial facilities for manufacturing in combination with BDA is evident. However, its research is still in its infancy, as reviewed above. Some questions need to be investigated and answered including: (1) how can PSS providers fully usage the multi-source operation and maintenance data of products in different conditions to improve the management and control capability of their products; (2) how can implementation of a procedural approach for conducting real-time data analysis to evaluate the health status and identify the faults of products pre-emptively.

2 Related Work and Knowledge Gap

Recent investigations show that appropriate operation mode aligns accessible product data with efficient data analytics, strongly interacts, and jointly determines the achievement of the desired product availability, utilisation rate, production planning and product quality for manufacturers [4, 20,21,22,23].

However, with the digitisation of the industry and the advancement of multisensory technologies, the PSS providers face many challenges. One major challenge is how to accurately and comprehensively capture the operation and maintenance big data of products during the whole PSS delivery processes (especially, the operational data of different products and different customers in different conditions). It provides complete and reliable data for PSS providers or customers to adjust and optimise their decision-making dynamically (e.g. fault diagnosis, maintenance measures predicting, production planning, sharing and leasing strategy planning, etc.). To solve this challenge, some PSS providers have begun to acquire big data and store it using cutting-edge information and communication technologies (ICT) [24, 25]. For instance, big data about product operation, including operation environment in the use phase can be measured through various sensors [26,27,28,29]. The acquired data can be analysed using existing data analytics techniques to derive broader and more useful information on health monitoring and fault diagnosis for the PSS providers [30]. Focused on the data acquisition and management within PSS, some frameworks to realise the processes mentioned above were also proposed [26, 31, 32]. However, most existing studies have been primarily focused on operation and maintenance decision-making based on the operational data of ‘smaller cluster products’. It should be noted that the term ‘smaller cluster products’ refers to a single category of products. Moreover, the users of the product are also relatively single and scattered. This is very common in existing and traditional product fault diagnosis method and PSS-based operation mode. As a result, the real potential of big data was not be showed and assessed, which have led to low efficiency of fault diagnosis and wastage of resources during maintenance procedures. Therefore, a new PSS-based operation mode to collect the operation status data of different products and multiple users in different conditions is needed.

As pointed by Kusiak [23], big data is a long way from transforming manufacturing, because most of the manufacturers and customers do not know what to do with the big data they have. Hence, how to apply the advanced analytics techniques to carry out efficient BDA is a vital task for manufacturers to transform their manufacturing mode and determine their competitiveness [33, 34]. For example, in order to optimise the design scheme of new products, the agent-based system (ABS) and artificial neural networks (ANN) were investigated [35, 36]. Based on the industrial Internet of things (IIoT) and manufacturing big data, a cyber-physical energy system to improve the energy efficiency of the dyeing process was developed [37]. The authors recommended that the system can optimise the dyeing process using machine learning techniques and manufacturing big data by adjusting cyber-physical energy systems without utilising expensive equipment. The genetic algorithm (GA) and fuzzy logic theory were used to optimise the shop floor scheduling [38, 39] and to control the product quality [40, 41] in a manufacturing big data environment. The machine learning-based techniques were used to predict material removal rate in polishing [42], to identify faults of motor bearing [30], and to better maintenance in semiconductor device manufacturing [43]. Processes enabled by these methods and techniques can revolutionise the design and operation and maintenance towards intelligence. However, major challenges exist in applying data-driven decision-making to capture the economic and business value of big data, such as how to design a solution to analyse the operational data timely and identify fault early for better management of products. Therefore, there are urgent demands for developing an efficient and procedural approach to analyse the data from PSS delivery processes and provide informed decisions for the operation and maintenance processes. Insights into this challenge are especially missing partly due to the lack of empirical research within industrial settings.

Recently, the opportunities arising from combined big data and PSS strategy were exploited by Opresnik and Taisch [44]. The authors pointed out that the value of big data depends on an adopted business model, including an operation mode and a way of value capture. This means that the challenges exist in how BDA brings value through an appropriate business model. Meanwhile, different business models have been researched in the production community to undergird the PSS providers’ competitive advantage. Meier et al. [7] and Annarelli et al. [45] provided a systematic overview of PSS and different methods, tactics, benefits, barriers, and partly also on drivers for them, while Roy et al. [46] focused on key technologies applied to maintenance services: self-repair technologies, digital maintenance repair overhaul (MRO), big data and visualisation of maintenance tasks. A lean PSS and its application in a mould making industrial were investigated by Mourtzis et al. [47]. A set of key performance indicators (e.g. design, manufacturing, customer, environment, and sustainability) for evaluating and improving the lean PSS design were recommended. Gao et al. [10] investigated a service-oriented manufacturing paradigm characterised by business model, industry insight and technology strength. The authors documented a practice of PSS in the turbomachinery industry where the provider accesses data of equipment in use by installing sensors. Lay et al. [48] reported the practice of PSS in industry and proposed several categories for operation modes: the provider not only maintains the ownership of machines but also operates and maintains them for multiple customers. Overall, better conditions for accessing data as wished may be provided by PSS to bring higher value from data. However, virtually no literature provides knowledge of exploiting PSS to bring out value from big data.

As reviewed above, the interaction of accessible product data and efficient data analytics and appropriate business model has attracted much attention in the literature recently. Current research mostly combines either available product data and BDA [49, 50] or align PSS with BDA [44, 51]. The researches that align with all three aspects are rarely available in the literature. This means that a knowledge gap exists in how available product data and BDA techniques bring value through an appropriate business model.

To address the foregoing challenges and fill in the gaps of scientific knowledge, this paper proposes a PSS-based and advanced operation mode that benefits from accessible product lifecycle big data and BDA technology. The leasing and sharing manner is important for the innovation of the proposed operation mode and is different from the existing ones. For the advanced operation mode, all shared products (e.g. processing devices or production machines) are exclusively controlled and managed by the PSS providers in the premises they provide and in a centralised manner. This manner can help industrial practitioners centralise the scattered orders to make full use of the processing devices and reduce resource and energy consumption. Besides, in the advanced operation mode, the customers do not need to purchase products and build plants, let alone ship products to their factories. The customers only need to pay the PSS providers by the usage time or by the processing quantity.

The rest of this paper is organised as follows: Sect. 3 describes the proposed PSS-based operation mode. Section 4 presents a procedural approach for fault diagnosis involving BDA based on the proposed operation mode. Section 5 demonstrates the application of this new mode to a real case of a leading CNC machine provider. In Sect. 6, the authors discuss the practical advantages of the new mode. Section 7 highlights the contributions and the future works of the paper.

3 Overview of the Proposed Operation Mode

The volume of data for operation and maintenance decision-making increases significantly with the wide application of sensors and wireless technologies in the PSS business solutions. Most PSS providers intend to interpret these big data and to improve their processes and services and products. Although the future of big data-enabled business strategy is promising, the implementation of comprehensive data access and efficient BDA in PSS is in its infancy. Within the existing PSS, an appropriate operation mode that including the full data access manner and useful knowledge capture approach is rarely available. Based on the existing PSS, this paper presents a new type and high-level operation mode, the so-called production-side sharing (also known as shared-use production machines), where the production machines are under exclusive control by the PSS providers.

For better understanding, the proposed operation mode is described with six characteristics compared to those existing ones in the industry according to literature, as shown in Table 1. The product is owned by the provider, located at the provider’s premises, and operated by the provider or multiple customers. Furthermore, the operation and maintenance processes data is collected and accessed by the provider. This contrasts to a traditional product sales mode in the rightmost column, which is valid for many cases until today. Furthermore, in the proposed operation mode, multiple customers use a product at a predefined location of the PSS providers, while in the traditional mode each of the customers uses its own product at their location individually. This means that multiple customers share a product (i.e., rent or share it by a time-bound contract or by the processing quantity) in the proposed mode. According to the order quantity of customers and the production capacity of machines, the provider can formulate and adjust the lease contract dynamically. In this way, machine idle caused by a decline in orders for the traditional operation mode could be avoided, thereby enhancing the utilisation rate of the production machines.

Table 1 The proposed mode as compared with existing ones in literature

Table 1 also shows two other modes relatively advanced from the traditional one. However, the product sharing mode [48] is different from the proposed mode in terms of who installs sensors and who accesses the operation data. The other one [10] is a mode where the provider installs sensors and may access the operation data to use the acquired data effectively. It means that the proposed mode can be regarded as a hybrid of these two modes in the literature [48] and [10].

According to the above analyses, the correlation and difference between the proposed operation mode and the normal PSS can be identified. It is worth noting that the advanced operation mode proposed in this paper is benefiting from the combination of the emerging ICT techniques with the PSS paradigm. Therefore, it has the same characteristics as the normal PSS and today’s digital servitisation business paradigm, such as Smart PSS/Cyber-Physical PSS [52, 53]. For example, the ownership of products is retained by the PSS providers, and the customers can lease, share and use products by a time-bound contract. It can facilitate sustainable production and consumption, decrease hazardous materials that end up in landfills, and help the PSS providers make more profits, and so on.

However, the leasing/sharing manner for the advanced operation mode is different from the existing PSS. In the proposed operation mode, all products can form a resource pool, and then they are leased and shared by various customers with an integrated service contract. As mentioned previously, all leased/shared products are completely managed by the PSS providers in the premises they provide and in a centralised manner. Therefore, the operational data of different products and diverse users in widely varied conditions can be collected more completely and accurately. In other words, the data from ‘large cluster products’ can be collected to obtain more training samples and to improve the fault diagnosis and fault prediction performance. Here, the term ‘large cluster products’ is named relative to the traditional PSS-based sharing mode. This is also an important characteristic of the advanced operation mode proposed in this paper, and it is also a special application scenario under this mode. Moreover, customers do not need to purchase products and build plants, decreasing the customers’ operation costs and investment risks. Meanwhile, the real-time and multi-source operations status data of different products and different users in diverse conditions can be collected more easily and comprehensively. This is very difficult (or only partially possible) for the traditional PSS leasing/sharing mode, since different products are often distributed in different factories of diverse regions and operated by different customers.

A major advantage of the proposed mode is the provider’s access to such data that is effective for various purposes (e.g. monitoring the health status and performing intelligent fault diagnosis). This is facilitated by the provider’s unique knowledge and rich experience on what is effective to be measured as well as its freedom for how sensors are installed. In addition, real timeliness of the data access and full access to the data (i.e., no loss of measured data) are advantages thanks to the provider’s control of the product’s operation. Another advantage is the provider’s faster physical access to the products because it is located at the PSS provider’s premises. The main advantages lie in the provider’s potential to deliver services to customers effectively and promptly.

Besides that, the proposed operation mode also has the following advantages: (1) sharing professional technicians with other customers, and guaranteeing the processing quality of products, especially for the technically sophisticated products; (2) sharing production orders with other customers so that each independent customer can finish orders timely and shorten product delivery cycles, and achieve value sharing with all stakeholders; (3) reducing the customers’ investment in personnel, equipment, workplace, etc., especially for the small and medium-sized enterprise (SMEs); and (4) reducing environmental impact by increasing product utilisation and decreasing product quantity in circulation.

The proposed operation mode extends beyond the mere sharing and leasing of the products as included in the traditional PSS. Through the IIoT and cyber-physical system (CPS) technologies, all products at the provider’s predefined location are able to sense and interact and reason, and to enable the ubiquitous connectivity and dynamic synchronisation. Therefore, the sharing of production, resource, staff, technology and service can be achieved in the proposed operation mode. As a result, the advanced operation mode may change the traditional manner of management, control and maintenance of the products. It can also be adopted to promote the transformation and upgrading of the traditional manufacturing industry, integrate existing resources, and maximise the value of all stakeholders.

4 A Procedural Approach for Fault Diagnosis Based on the Proposed Operation Mode

Based on the proposed operation mode, a large amount of real-time and multi-source operation status data of production machines (e.g. used by different customers, different machines and different operating conditions) is produced. These multi-source data are promising assets that can be used to monitor machines’ health status and discover the hidden fault features. Therefore, within the proposed mode, this article focuses on fault diagnosis of a production machine using acquired operational data and efficient data analytics, and proposes a procedural approach, which consists of four main modules (as seen in Fig. 1) namely: (1) establishing smart production machines; (2) acquisition and preprocessing multi-source data; (3) establishing deep neural network (DNN) models; and (4) DNN-based intelligent fault diagnosis. They are described in Sects. 4.1 to 4.4 in detail.

Fig. 1
figure 1

Processes for fault diagnosis based on the proposed mode

4.1 Establishing Smart Production Machines

The smart production machines should be established to improve the sensing and interacting capability of all kinds of machines within the predefined location of the provider. Therefore, the configuration of the multiple sensing devices in production machines is necessary for achieving the proposed operation mode and fault diagnosis approach, when putting them into use. According to the methods to configure the smart objects of different lifecycle stages, radio frequency identification (RFID) and smart sensor [54, 55] are used to establish the smart production machines and to collect the usable data sets for the fault diagnosis as exemplified in Table 2. The production machines have a certain degree of intelligence because they can act like humans to sense the environment by sensors, decisions by control appliance of the monitoring system, and to communicate with the WIFI or WLAN technology.

Table 2 Sensing devices relevant to the proposed mode (examples)

4.2 Acquisition and Preprocessing Multi-Source Data

Based on the configuration of the smart production machines, an active sensing environment can be constructed. As a result, the real-time and multi-source operation status data of production machines (e.g. used by different customers, different machines and different operating conditions) can be captured timely. The operational data of machines captured by sensors is the main data for traditional model establishment. To improve the accuracy and efficiency of fault diagnosis, in this article, the data of other lifecycle stages (e.g. product design, machining process and historical maintenance records) is also taken into account, to provide comprehensive data for model training.

The data generated from the machining process could be presented as a stream of tuples in the form MP_data = {O, M, EPC, J, T}, since these data are mainly captured by RFID in the smart production environment. Here, O is the operator responsible for executes the machining task. M is the machine where the machining task executed. EPC (Electronic Product Code) is the unique identifier of material machined by an operator on a machine. J is a specific job that associates with a certain order. T is a timestamp that records the exact time when the machining task takes place. The machining process data cube is established to store the tuples. Additionally, the real-time data warehouse is established to organise and manage the tuples according to a time sequence and address the complex logic relationship among enormous tuples (as seen in Fig. 2).

Fig. 2
figure 2

The machining process data cube in the data warehouse

The machining process data cube contains four dimensions: tuple, information, production logic and time. Key attributes including Operator, Machine, EPC, Job and Timestamp are described in the tuple dimension. This indicates who did what, where, and on what time. In the information dimension, the tuple dimension attributes are converted into meaningful and specific sub-attributes, as shown on the top of the data cube. The predefined production operations are executed in the production logic dimension, including material delivery task, machining sequence, and machining code. In the time dimension, the time stamp of the machining process data cube is recorded. By combining the time dimension and production logic, the information of the machining element can be tracked.

The inaccuracy, incompletion and redundancy exist in the created machining process data cubes, which should be reduced by data cleansing operation. For data cleaning operation, the input is a set of machining process data cube from the real-time data warehouse. The processes of data cleansing operation are described as follows: (1) define inputs and constraint conditions of the data cleaning operation; (2) select machining process data cube from the real-time data warehouse; (3) check whether each cell in machining process data cube meets a predefined constraint condition; (4) delete the cell from machining process data cube if the cell cannot meet the condition; (5) repeat (2) to (4) until all machining process data cube are traversed; (6) output and return the cleaned machining process data cube.

The cleaned machining process data cubes are typically still huge. Data analyses of huge amounts of data may make it impractical or infeasible. Therefore, a data extract operation is performed to remove the data with little meaning for fault analysis, acquire critical machining process data cubes, and obtain a reduced representation of the cubes that are much smaller in volume. Meanwhile, it must retain the integrity of the original data. As a result, the critical machining process data cubes are transmitted to the cube that contains the key impact factors related to fault analysis. For data extract operation, the input is the cleaned machining process data cubes. The output is the reduced data cubes. The processes of data extract operation are described as follows: (1) select cubes with have same Machine ID from the cleaned machining process data cubes; (2) check whether each attribute in the selected cubes meets the logic in the tuple mentioned above; (3) produce a new sequence set if the all attributes of the selected cubes meet predefined logic of tuple; (4) repeat (1) to (3) until all cleaned machining process data cubes are traversed; (5) output and return the extracted data cubes.

In addition, some design data related to fault diagnosis (e.g. product design specifications and maintenance/service instructions) and historical fault records often exist in the form of free text with low quality, and cannot be directly applied to the model training process. Therefore, the data preprocessing should be conducted to extract the keywords and topics from the free text. The method of text mining [4, 56] is used to clean the redundancy data from free text and extract the topics that can precisely describe the affected data of fault diagnosis. For example, regarding product design data, the design specifications and maintenance instruction data are extracted for the new data format for fault diagnosis: machine ID, fault part ID, spare part ID and fault type. For historical maintenance records, the data of timestamp, machine ID, fault part ID, fault type, maintenance engineer’s ID, repair time, etc. are extracted.

Compared with the data mentioned above, the sensor data sets are usually generated automatically and considered high quality and adequate veracity. Relevant data are extracted from these sensor data sets, leading to the following formats: timestamp, sensor ID, machine ID, part ID, measuring point, and measuring value. As a result, the preprocessed machining process data, product design data, maintenance records, and sensor data collectively constitute the multi-source input data in the subsequent model training.

4.3 Establishing DNN Models Based on the Preprocessed and Multi-Source Lifecycle Data

Based on the previous two modules, multi-source and available lifecycle data are acquired. Within this module, how to use the preprocessed and available data to learn complex association relationships between input data sets and output fault types, to adaptively mine the fault features and improve the accuracy of fault diagnosis, is challenging. Through deep learning, DNNs with deep architectures can be trained to achieve these objectives [57]. In generally, DNNs training consists of the following two stages: (1) pre-train the DNNs layer by layer with unsupervised learning, such as Auto-encoder (AE) or sparse filtering; (2) fine-tune the DNNs with Back Propagation (BP) or Levenberg–Marquardt (LM) algorithm for classification [30].

As an unsupervised neural network, AE is usually used for extracting the data feature, which performs much better than the basic single hidden layer networks. Moreover, the AE can use fewer neurons to denote the information of input layer rather than more neurons. As a result, the AE can use encoder network transforms the input data from a high-dimensional space into codes to extract the features of the input data, and then reconstruct the inputs from the corresponding codes through decoder network [58]. Considering the input lifecycle data (i.e. product design data, maintenance record, and operation data) has the characteristics of multi-source and high-dimensional, thus the AE is adopted to pre-train the DNNs. For better describing, the notations used in the pre-training process are defined as seen in Table 3. During the pre-training process, these notations are applicable to the training sample of different lifecycle stages.

Table 3 The defined notations

Given an input training sample \({\varvec{x}}^{k}\) from unlabeled lifecycle data sample set \(\left\{ {{\varvec{x}}^{k} } \right\}_{k = 1}^{K}\) of production machine, \({\varvec{x}}^{k} \subset R^{1 \times n}\) is a sample and \(n\) is the dimension of the input sample. Here, \({\varvec{x}}^{k}\) includes multi-source data sample, such as design data, maintenance record and operation status of different production machines. The encoder network is defined as an encoding function \(f_{\alpha }\), and the AE transforms each input training sample \({\varvec{x}}^{k}\) into a hidden layer encode vector \({\varvec{h}}^{k}\) through the activation function \(s_{f}\) of the encoder network. Similarly, the decoder network is defined as a reconstruction function \(g_{{\alpha^{\prime}}}\). Then the hidden layer encoder vector \({\varvec{h}}^{k}\) is mapped to a reconstruction input vector \(\hat{{x}}^{k}\) through the activation function \(s_{g}\) of the encoder network.

During this process, the AE aims to train the parameters set \({\varvec{\alpha}} = \left\{ {{\varvec{w}},{\varvec{b}}} \right\}\) and \({{\varvec{\alpha}}^{\prime}} = \left\{ {{w^{\prime}},{b^{\prime}}} \right\}\), and to acquire the minimum reconstruction error \(\phi_{AE} \left( {{\varvec{\alpha}},{{\varvec{\alpha}}^{\prime}}} \right)\) between the encoded outputs and the original inputs through traversing the whole training sample \(K\):

$$\phi _{{AE}} (\alpha ,\alpha ^{\prime }) = \frac{1}{K}\sum\limits_{{k = 1}}^{K} {L\left( {x^{k} ,\hat{x}^{k} } \right)}$$
(1)

where \(L\left( {{\varvec{x}}^{k} ,\hat{{x}}^{k} } \right)\) is a loss function that measures the discrepancy between \({\varvec{x}}^{k}\) and \(\hat{{x}}^{k}\):

$$L\left( {{\varvec{x}}^{k} ,\hat{{x}}^{k} } \right) = \left\| {{\varvec{x}}^{k} - \hat{{x}}^{k} } \right\|^{2} { = }\left\| {{\varvec{x}}^{k} - g_{{\alpha^{\prime}}} \left( {f_{\alpha } \left( {{\varvec{x}}^{k} } \right)} \right)} \right\|^{2}$$
(2)

Based on the DNN pre-training architecture, \(M\) AE are trained to pre-train an \(M\)-hidden-layer DNN, as depicted in Fig. 3. This pre-training process has been proven to help to achieve better generalisation in classification and fault diagnosis tasks. The details of the DNN pre-training process are illustrated as shown in Fig. 4.

Fig. 3
figure 3

Pre-training process of DNN (HL is the abbreviation of the hidden layer)

Fig. 4
figure 4

The flows of DNN pre-training

In order to monitor the health status of the production machine, the output layer of DNN that contains output targets for classification tasks is added. The output \(y^{k}\) of DNN from the input data \({\varvec{x}}^{k}\) is expressed as:

$${\varvec{y}}^{k} = f_{{\alpha_{M + 1} }} \left( {{\varvec{h}}_{M}^{k} } \right)$$
(3)

where \({\varvec{\alpha}}_{M + 1}\) is the parameter sets of the output layer.

Suppose that the output target of \({\varvec{x}}^{k}\) is \({\varvec{d}}^{k}\). To approximate the output target, the BP algorithm is utilised to calculate the parameters in the DNN backwards. The error criterion is calculated by:

$$\phi_{DNN} \left( {\rm A} \right) = \frac{1}{K}\sum\limits_{k = 1}^{K} {L\left( {{\varvec{y}}^{k} ,{\varvec{d}}^{k} } \right)}$$
(4)

where \({\rm A} = \left\{ {{\varvec{\alpha}}_{1} ,{\varvec{\alpha}}_{2} ,...,{\varvec{\alpha}}_{M + 1} } \right\}\). Through minimising \(\phi_{DNN} \left( {\rm A} \right)\), DNN has achieved the fine-tune operation. As a result, the DNN model is trained. Furthermore, the parameter set \({\rm A}\) of the DNN can be updated through the learning rate \(\kappa\) of the fine-tune operation, which is introduced to guarantee convergence in the update process:

$${\rm A} = {\rm A} - \kappa \frac{{\partial \phi_{DNN} \left( {\rm A} \right)}}{{\partial {\rm A}}}$$
(5)

4.4 DNN-based Intelligent Fault Diagnosis Method

The data and model explained in Sects. 4.2 and 4.3 are used to discover fault features. A parameter represents each with a value (or a set of parameters with values) that characterise a symptom of a fault or its causes. For instance, a parameter, processing depth of a tool, reflects a symptom, tool wear. A large amount of data related to the auxiliary parts and the operation parameters are used to learn complex correlations between the input multi-source data and fault symptoms to identify fault features. The procedure of the DNN-based fault diagnosis is shown in Fig. 5. This is expected to produce higher diagnosis accuracy due to the multi-source big data of the production machine, as stated at the beginning of Sect. 4.

Fig. 5
figure 5

Procedure of the DNN-based fault diagnosis

Step 1 collects and standardises multi-sources data to be used for training samples. The standardised lifecycle data comprise the training set \(\left\{ {x_{DD}^{i} ,d_{DD}^{i} } \right\}_{i = 1}^{K}\), where \(x_{DD}^{i}\) is the \(i\) th lifecycle data sample of the production machine for training, \(d_{DD}^{i}\) is the health status label of \(x_{DD}^{i}\), and \(K\) is the number of the lifecycle data sample. Step 2 trains the DNN models by using standardised lifecycle data. The unsupervised learning is used to pre-train M AEs layer-by-layer to establish a DNN with M hidden layers. The number of input samples is the dimension of the unlabeled lifecycle data train set \(\left\{ {x_{DD}^{i} } \right\}_{i = 1}^{K}\). Then, use the \(i\) th encode vector \(h_{i}^{k}\) as the inputs to train the \(i + 1\) th AE for initialising parameters of the \(i + 1\) th hidden layer of the DNN, and obtain \(h_{i + 1}^{k}\). The training process is shown in Fig. 4. This process is executed in turn until the Mth AE is trained to initialise the final hidden layer of the DNN. Meanwhile, the dimension of the output layer number is determined according to the number of health status samples of lifecycle data for the production machine. The BP algorithm is used to fine-tune the parameters of the DNN, and minimise the error between output and labelled health status sample. Here, the design specification data of key components and auxiliary parts are used as the labelled sample. Step 3 utilises the trained DNN to output fault diagnosis results and to predict the faults.

For the model updating module, it can be carried out in two ways. On the one hand, the real-time updated operation status data of production machines can be served as new training and test samples to train and update the model. On the other hand, the method of transfer learning can be used to modify and update the trained model so as to make it suitable for different fault diagnosis problems. By model updating module, the time of model training and fault diagnosis can be shortened. Moreover, model updating will help to discover more hidden fault features and to promote the implementation of dynamic prediction for product faults.

5 Application to a Real Case at a CNC Machine Manufacturer

5.1 Overview of the Target Case Company

The case company is a leading CNC machine manufacturer in China. Over 100 key customers use the company’s CNC machines to process nearly 200 kinds of precision and high-end products: e.g., precision molds and specular machining. It is of utmost importance for the company to prevent a fault and to ensure the processing quality. The company was seeking a new way to realise the potential of sensing and using the operation environment data as well as using lifecycle big data for fault diagnosis, and therefore tested the new operation mode according to Sect. 3.

In this case study, some data and perspectives were collected and summarised from the semi-structured interviews with the general manager of the case company (also a co-author of this paper). Meanwhile, the general manager participated in the whole processes of the case study with action research. The interview and action research allowed to enhance the validity of the case’s constructs and provided strong support to verify the effectiveness and feasibility of the method proposed in this paper.

5.2 Configuration of the Smart Production Machine and Data Sources

For the simplicity of understanding but without losing generality of the principle, a certain type of CNC machine is selected to illustrate the solution of production machines intelligentialisation and multi-source data acquisition adopted by the case company. For example, multiple types of smart sensors are applied to configure a smart CNC machine and capture its multi-source operation data. The deployment information is shown in Table 4. In addition, this type of CNC machine is leased and shared in the predefined location of the company by multiple customers, such as 3C small hardware industry, precision mode & die and electrodes industry, hard-cutting materials industry.

Table 4 The deploy information of the smart sensors for a specific type of CNC

The proposed mode described in Table 1 was implemented. Two types of data were used in the fault diagnosis: (1) design data, i.e., geometric tolerances and specifications. (2) NC system and sensor data obtained over 12 months, e.g., current of the spindle motor, geometric errors of WIP from probes, compressed air humidity from humidity sensors, and temperature and vibration in a room from thermometers and vibration sensors.

Figure 6 shows an example of real-time measuring of the geometric errors on the company’s CNC machines. In the machining process, the photoelectric touch probe is used to firstly measure coordinate values of a series of points from a WIP, and secondly calculate the processing depth of tools and the geometric errors of the WIP in real-time: e.g., a batch of 2 million pieces of products processing, each WIP picks up five measuring points and thus, 10 million geometric error data sets were produced.

Fig. 6
figure 6

Real-time measuring of the geometric errors of WIP

An excerpt of the real-time measured geometric error data sets for this batch of processing products is shown in Table 5: the specified processing depth of a tool T01 is 0.15 ± 0.02 mm. For the WIPs 00020 to 00029, the measured value of processing depth was within the tolerance (± 0.02 mm), while for WIPs 00235 to 00244, the measured value exceeded the tolerance.

Table 5 Excerpt of the real-time measured geometric error data sets

In this case study, a 1-input 1-output 2-hidden 4-layer DNN was designed to identify the fault features of the production machine. A total of 1500 data sets were extracted to train and test the fault diagnosis model. Based on experience settings, the structure of the DNN model is designed as [1500, 300, 64, 1], which means the established network contains 1-input layer (1500 neurons), 2-hidden layer (300 and 64 neurons, respectively) and 1-output layer (1 neuron). To test the generalisation of the model, the data sets were randomly divided into the training data and the test data, among which training data accounted for 80%, and test data accounted for 20%. In the pre-training process, two AEs are used to initialise the weights and thresholds of hidden layers, and the maximum iteration number of the AE is set as 100. In the fine-tuning process, the maximum iteration number of the whole model is set as 300. As a result of these settings, the DNN-based fault diagnosis model of the production machine can be established.

5.3 Results of Applying the Operation Mode and Procedure

The data mentioned in Sect. 5.2 was applied to the procedure of Sect. 4.4. In this case study, the computational time of DNN model mainly includes three parts: (1) the parameters mapping time from the input layer to the first hidden layer based on AE1 (56.66 s); (2) the parameters mapping time from the first hidden layer to the second hidden layer based on AE2 (4.28 s); and (3) the fine-tuning time for the whole DNN model (11.59 s). Therefore, the total time for the establishment of the DNN model is 72.53 s. Meanwhile, the loss function value and the accuracy of the training data are 0.3874 and 0.9205, respectively; the loss function value and the accuracy of the test data are 0.4375 and 0.8968, respectively. The experiments were performed on a workstation (Intel(R) Core (TM) i7-7700 K CPU @ 4.20 GHz) with 32G of RAM, Windows 10 Enterprise Edition operation system with 64-bit, and Matlab 2017a was used to train and test the DNN model. The results are described for two fault features, spindle position accuracy decrease and spindle shaft break, as examples. It was found that there exist strong correlations between the tool wear versus the real-time measured WIPs’ geometric errors, tools’ processing depth, and spindle motor current and load. Details of the correlations are as follows.

Firstly, when the measured over-tolerance is continuously lower than -0.01 mm for either the WIP’s geometric error or the T01 tool’s processing depth, the tool was likely to be worn out. This can be understood as a cause and effect relationship between worn-out tools and the two kinds mentioned above of over-tolerance phenomena.

Secondly, as the tool wear increases, the spindle position accuracy decreases. When the spindle load torque runs exceeding 45% of its full load torque for a long time after the tool was worn over 50% of the original width, the probability of spindle shaft break is significantly higher. This can be considered as a correlation between tool wear and spindle shaft break. When the tool worn width reaches a certain threshold (e.g. 50% of the normal width) the spindle shaft break is likely to occur. By analysing the load current of the spindle motor, which reflects the spindle load torque, it was found that, when the tool wear width reaches 45% -50% of its original width, the spindle load torque will reach 40%-60% of its full load torque. This appears a near-linear growth. As the wear increases, the tools are gradually drifting away from the surface of WIPs. Consequently, the spindle load torque will drop to 0% while processing WIP. This signifies a high risk of the shaft breaking and the tool must be changed immediately. For the experiment batch of processing among 2 million pieces, the spindle shaft break occurred due to worn-out tools five times.

Thirdly, as the compressed air humidity increases, the spindle faults are more likely to occur. This makes sense, as the higher humidity causes more internal rust of a spindle. Together with controllers of the humidity, the procedure contributed to decreasing the frequency of the spindle faults: from 120 times to less than three times per year and per 1000 CNC machines.

The procedure was applied in real-time, and the feedback control was realised to predict and prevent the faults in advance. According to historical product faults, a large-scale knowledge base can be established by the PSS provider. By designing and creating a knowledge learning, sharing and self-optimisation mechanism among all product users, the high-quality fault diagnosis model can be trained using a small amount of newly marked data. Therefore, the migration among similar knowledge can be realised timely. The shared and migrated knowledge can provide valuable assets for PSS provider to carry out feedback control, fault diagnosis and fault prediction. This is difficult to carry out under the traditional PSS mode, due to the products are usually distributed in different regions and factories.

Table 6 compares the efficiency and effectiveness of the fault diagnosis. It can be seen from Table 6 that the fault prediction time can be greatly reduced by using the proposed method in this paper. Based on the semi-structured interviews with the general manager of the case company, the motivations and advantages for reducing the time for the prediction are analysed and summarised as follows:

Table 6 The time (seconds) for fault prediction with and without the procedure

1) By reducing the fault prediction time, the temporary shutdown caused by accidental failure can be avoided to ensure smooth production and improve production efficiency.

2)The product faults can be located and eliminated in advance. Therefore, the risk of damage to other parts and the whole system caused by one part fails can be avoided.

3)A high maintenance cost caused by breakdown maintenance can be avoided. Meanwhile, it can reduce the threat to personal safety operators.

6 Discussion

The proposed operation mode and procedural approach were found effective, as shown in Sect. 5.3. Compared with other works, e.g. [30], the main difference lies first in the coverage and the source of collected data. The mode enables access to operation status data of non-key components (e.g. auxiliary parts), design data, WIPs’ geometric error data, and operation parameters (e.g. compressed air humidity) and operating environment data. The procedural approach with the collected data allows the PSS providers to find more fault features and to find conventional fault features within a shorter time. Therefore, the DNN-based machine diagnosis can improve the PSS performance by minimising the maintenance cost [59] and making machine tool more durable [60]. The other main difference is adaptability: the data and procedure need no prior definition of a specific problem to be found tending to focus on data from the key components. For instance, improper setting of the operating environment that leads to a fault of a key component but is usually overlooked can be detected to prevent the fault. The accessible product data, procedural approach and advanced operation mode as a trichotomy are the keys to exploit the potential of the big data. It should be emphasised that the provider’s unique knowledge about possible causes for the faults is effectively utilised in the proposed mode, in particular, regarding where to install appropriate sensors. Lastly, it is evident that what cannot be measured cannot be addressed: e.g. under micro-cutting, the change of spindle load torque is too small to be measured, and therefore, fault diagnosis of tool wears and spindle shaft break in these conditions are not applicable.

According to the semi-structured interviews with the general manager, in the past, even though all machining parameters of the machines were set in a suitable range, the final product sometimes failed in reaching the required precision. By applying the new mode and procedure with big data, more fault features were found and thereby, the fault diagnosis performance was enhanced. In addition, through the advanced operation mode and DNN-based machine diagnosis, the case company performed active preventive maintenance to eliminate the faults earlier and ensure the fluency of production processes. This has moved them from planned corrective maintenance to proactive and smart maintenance planning, and reduced their maintenance costs and customers’ use costs while substantially reducing material’s and production time wastage. The general manager also stated that they could share machining technology, technical solutions, and even lean management patterns in real-time. These have brought more profits and created more and more value-added service for them. Although a substantial amount of cost and effort is needed to invest in the new operation mode, the benefits outweigh the investment. It should be noted that collecting the big data provides the manufacturer with additional benefits such as providing new ideas for R&D. Further, other potential benefits of this mode exist: e.g., the usage rate of CNC machines can be increased because of the shared use, and thereby resource efficiency is expected to increase.

The disadvantages of the proposed mode include the customer’s need to transport the inputs to and outputs from the machine between the machining site and the customer. Another downside is the risk of a customer’s sensitive information (e.g. what are produced by the machine) being revealed to the provider. These give a limitation to the adoption of the proposed mode. However, most of the customers have long-term cooperation with the case company and highly trust them. Therefore, this issue was minor as compared to the benefits they can gain as explained, such as the provider’s unique knowledge utilised within the mode. Meanwhile, the emerging Blockchain technology [61, 62] could establish a trust mechanism between the customers and the PSS providers and solve the data security problem. In addition, the authors found that the Recurrent Neural Network, such as 3D Convolutional Neural Network [63, 64] or Long and Short Term Memory [65, 66] with recursive structures can be used to extract temporal features from historical data and to simulate the temporal relationships between individual data points clearly. Therefore, the recursive structures will be considered and involved in the future to make the hidden layers of DNN model to achieve self-invocation across time nodes. This is very important in improving the accuracy of fault diagnosis and fault prediction and better implementing the preventive maintenance strategy.

7 Conclusions and Future Works

The PSS has become a pervasive business strategy among manufacturers, enabling them to improve their sustainable competitive advantage and to avoid the potential defect of products. However, with the wide application of sensors and wireless technologies, the PSS provider faces many challenges. For example, within existing PSS, there is a lack of an appropriate operation mode that combines available data collection approaches and efficient data analytics to show and assess the real potential and value of the operation and maintenance big data of different products, different customers and different conditions.

To solve the challenges mentioned above and problems, this article proposed a new PSS-based operation mode and a procedural approach for fault diagnosis using deep learning and lifecycle big data to enhance maintenance performance. They were validated with a real case by a CNC machine provider. This can be seen as one way of optimising a production system to improve planning and management. The main contribution lies in advanced scientific and practical knowledge for how BDA can create value through an appropriate operation mode. The operation mode here involves organisational issues such as ownership, location, and machines operator.

Future works will focus on the following three aspects. The first is to explore the approaches of dynamic scheduling and allocation of production resources (including personnel, equipment, technology, service, etc.) within the proposed operation mode, so as to maximise the utilisation rate of these resources (as stated in Sect. 3). The second is to investigate the mode’s effect on resource efficiency as stated in Sect. 6 considering environmental sustainability. In the case study, the model is only trained by one work element. Therefore, the third is to verify the generalisation of the proposed model based on different datasets or different work elements.