Abstract

Industry 4.0 focuses on continuous interconnection services, allowing for the continuous and uninterrupted exchange of signals or information between related parties. The application of messaging protocols for transferring data to remote locations must meet specific specifications such as asynchronous communication, compact messaging, operating in conditions of unstable connection of the transmission line of data, limited network bandwidth operation, support multilevel Quality of Service (QoS), and easy integration of new devices. The Message Queue Telemetry Transport (MQTT) protocol is used in software applications that require asynchronous communication. It is a light and simplified protocol based on publish-subscribe messaging and is placed functionally over the TCP/IP protocol. It is designed to minimize the required communication bandwidth and system requirements increasing reliability and probability of successful message transmission, making it ideal for use in Machine-to-Machine (M2M) communication or networks where bandwidth is limited, delays are long, coverage is not reliable, and energy consumption should be as low as possible. Despite the fact that the advantage that MQTT offers its way of operating does not provide a serious level of security in how to achieve its interconnection, as it does not require protocol dependence on one intermediate third entity, the interface is dependent on each application. This paper presents an innovative real-time anomaly detection system to detect MQTT-based attacks in cyber-physical systems. This is an online-semisupervised learning neural system based on a small number of sampled patterns that identify crowd anomalies in the MQTT protocol related to specialized attacks to undermine cyber-physical systems.

1. Introduction

From a conceptual approach, Industry 4.0 [1] can be seen as a new organizational level of automated value chain management methods, encompassing the full life cycle of the industrial process, from raw materials to the final product. Analyzing this model, its main and key feature was identified in the integration of industrial processes with the wide integration of various information and communication systems, methods, resources, and information flows, through industrial networks and broad communication of cyber-physical systems [2].

Cyber-physical systems are a supergrid of collaborative computing and communication components that monitor, coordinate, and control physical entities through feedback loops, where processes occurring in the physical domain influence the computations that are performed and the other way around [3, 4]. The dynamics of physical processes are multiplied by the dynamism of software and networking in these systems, providing abstract models of technical analysis and design for a unified whole, more related to the intersection rather than the union of the physical world with the digital world [5, 6].

Essentially, cyber-physical systems are a new generation of advanced systems that achieve, through information technology, communications, precision control, coordination, and autonomy, the union of the physical world with the digital world [7, 8]. These systems provide extensive M2M communication with easy-to-use and simple protocols for the integration of processes between interconnected sensors to carry out bilateral controls, assisting in a decentralized decision-making process [9].

The MQTT protocol [10] with its simplicity of installation and use and the low need for system resources has managed to dominate and eventually become the main and widespread messaging protocol for communication between cyber-physical systems, as well as in embedded systems with IoT/IIoT capabilities [7, 10]. The 3 basic components of the protocol are the MQTT Broker, which is also the server of the messaging system that receives and manages the information, the MQTT Publisher, which is also the sender of the information on the server, and the MQTT Subscriber, which is the recipient who connects to the server and receives the information. A typical example of a general approach to this communication based on the MQTT protocol architecture is shown in Figure 1.

This design communication protocol, although designed to support Transport Layer Security (TLS) and Secure Sockets Layer (SSL), uses plain text to transmit information, while allowing anonymous users to connect and publish/subscribe messages by default [11]. Also, an important security gap is the fact that it has “open” SYS-topics, which are used for specialized processes such as monitoring and configuration, which means that anyone can send fake data to clients or reprogram devices.

For the Industry 4.0 business environment to achieve its goals, it is particularly important and timely to create the processes and resolve issues related to secure M2M communication to ensure the operational continuity and productivity of the systems operating in the specific environment. Production facilities, and industrial systems in general, require a different type of security than corporate networks, as traditional security solutions, such as antimalware and firewalls, do not fulfill industry norms and requirements [12]. Accordingly, the protection of industrial confidentiality requires robust safeguard policies, as this information is the target of industrial espionage by well-organized highly specialized cybercrime groups. Under this consideration, it is a fact that cybersecurity is not the primary key issue of the architectural design of industrial infrastructures. Also, it is not economically practicable to fully upgrade them, while it is almost impossible to isolate them partially or completely from the network they operate. In conclusion, the protection of industrial infrastructure from cybersecurity incidents is critical, as any kind or size of failure can create dynamic interdependencies and incalculable economic consequences.

In this sense and recognizing not only the necessity of use but also the vulnerabilities that characterize the communication based on the MQTT protocol, this paper presents an online-semisupervised learning neural anomaly detection system to detect MQTT-based attacks, without special requirements and resources [13].

The rest of the paper is organized as follows: Section 2 highlights some of the main related works, Section 3 analyzes the proposed system in detail and also mathematically; the Experiments section describes the data used and the scenarios taking into account the implementation of the proposed system, and, finally, the Conclusions section summarizes the main research contribution, the novelty of the approach, and the future studies that can extend the proposed methodology.

2. Literature Review

The tremendous increase in data exchange across various IoT sensors and communication protocols has heightened security worries, highlighting the importance of robust methods to identify threats quickly and accurately [6, 14]. Security professionals and researchers rely more and more on automated methods with the help of deep learning to enhance the effectiveness of anomaly detection [15]. Deep learning is a type of artificial intelligence that models the learning process using many neurons and it is becoming increasingly popular in business [16].

For example, Ullah and Mahmoud [17] designed and developed an anomaly-based intrusion detection model intended to be used for IoT networks by using a convolutional neural network (CNN) model to build a multiclass classification model. This particular scheme is then realized in 1D, 2D, and 3D using CNN. The BoT-IoT, IoT Network Intrusion, MQTT-IoT-IDS2020, and IoT-23 intrusion detection datasets are utilized to evaluate the CNN implementation. They further utilized the CNN multiclass pretrained scheme to implement binary and multiclass classification via transfer learning. Also, Haripriya et al. [11] proposed Secure-MQTT, a lightweight fuzzy logic-based IDS for identifying nefarious activities during IoT device exchange of data. With the use of a fuzzy rule interpolation mechanism, the proposed solution uses a fuzzy logic-based system to detect the node’s nefarious activity. Secure-MQTT eliminates the necessity of a dense rule base by utilizing fuzzy rule interpolation, which dynamically produces rules. This suggested technique includes a system for preventing Denial-of-Service attacks on low-configuration devices. Vaccari et al. [18] suggested MQTTset, a dataset centered on the MQTT protocol, which is frequently used in IoT networks. By simultaneously validating the legal dataset with cyber attacks on the MQTT network, they demonstrated the establishment of the dataset in addition to validation by the creation of a fictitious detection system. Results showed how this system can be utilized to train similar learning models for detecting systems that can secure IoT environments.

On the other hand, Hasan et al. [15] presented a machine learning-based method for detecting and protecting a system in an abnormal state. Several machine learning classifiers were used to complete this challenge. Another point of this study is the recognition that, for anomaly detection, a simple model like Decision Tree or Random Forest can be measured to a more complex network such as an ANN. Ciklabakkal et al. [19] proposed ARTEMIS, an IoT IDS that analyzes data from IoT devices using artificial intelligence to notice deviations from the system’s usual attitude and sends notifications in the event of abnormalities. They have carried out a model of the system utilizing IoT devices that are subscribed to topics at a MQTT broker, and they have tested it against MQTT-related threats. In addition, Zhang et al. [20] built the FedIoT platform, which includes an N-BaIoT synthesized dataset, the FedDetect algorithm, and a system layout for IoT devices. The FedDetect learning system uses an adaptive optimizer and a cross-round learning rate scheduler to boost performance. They tested the FedIoT platform and FedDetect algorithm in a network of IoT devices, such as Raspberry Pi, in terms of model and system performance. The findings showed that federated learning is effective in catching a wide variety of cyber attacks. The system competence analysis reveals that both end-to-end practice time and memory cost are economical and promising for the limited resources of IoT devices.

Unlike presented related works, in this research, we present a unique real-time anomaly detection technique for detecting MQTT-based assaults on cyber-physical systems based on a small number of sampling patterns that identify crowd irregularities in the MQTT protocol [10].

3. The Proposed System

As already mentioned, IoT/IIoT, and in general the M2M communication of cyber-physical systems, relies on wireless technologies, which are used to provide data access to end devices [21, 22]. For the implementation of these services devices of limited resources are used, with very low energy consumption and, respectively, low power, while due to their distributed architecture, they present serious weaknesses in their flexible management and consequently in their application to modern requirements, such as interoperability, mobility, heterogeneity, and quality of services [23]. Therefore, the application of advanced digital security techniques should follow algorithm design and implementation technologies, considering features such as the traffic of network nodes, the speed, and quality of communication between them, as well as the minimum available computing resources [24, 25].

Complying with the above important conditions, this paper proposes a small and flexible neural network architecture, which can respond even to the processing of big data. Specifically, a Single-hidden Feedforward Neural Network (S-hFFNN) [26] is implemented, with random N hidden neurons, random weights W in the input layer, and output weights β being assigned based on the Generalized Least Squares Approximation (GLSA) [27] technique and random bias b, so that the weights at the output are calculated by a single linear algebra operation and in particular by a single array multiplication, without requiring repetitive learning procedures.

The random weights generate approximately rectangular and weakly correlated features at the hidden layer which offers an accurate solution and high generalization abilities. More specifically, the output of the proposed S-hFFNN with random hidden neurons in the hidden layer can be represented as follows [2628]:

From this point of view, this method can solve the learning problem , where is the target class and the hidden output is as follows [2628]:

Table H is calculated from the equation and the exit weights β from the following relation [2628]:

Although the algorithm works well in terms of accuracy, training times, and overall performance in classification problems, it is proven experimentally (trial and error) that it presents some weaknesses that it creates problems along the way. In particular, it relies solely on the determination of empirical risk minimization to be able to overcome the problem of overfitting (according to statistical learning theory, true risk prediction is calculated by finding a balance between empirical and constructive risks), presents limited control capabilities since it directly calculates the minimum norm based on the GLSA method, and may ultimately lead to less reliable results due to heteroscedasticity or outlier [26].

Therefore, to avoid the above problems in the proposed system, a regularized [26] form of S-hFFNN is used to punish the coefficients of the output weight table to minimize the output error. β can also be calculated from the Moore-Penrose’s relation [28]:

The solution of the above equation can be reduced to a generalized optimization problem, where the cost function is convex and the constraints are linear for . The solution is achieved by the Lagrange multiplier method, based on which the following function is formed [29]:where the coefficients are the Lagrange multipliers. By this logic, the solution of the initial optimization problem is reduced to a saddle point optimization problem of . In particular, this point should be maximized for α and minimized for and b; that is [30],

But because the problem is nonlinearly separable due to uncertainty, representation inaccuracy, and noise, the purpose of the algorithm is to minimize error. For this purpose, a new set of positive numbers are introduced, measuring the deviation of the data from the correct categorization and imposing penalties accordingly for regularization of the algorithm. So the decision-making surface has the following form [30, 31]:where are the regular parameters, while the corresponding optimization problem is transformed as follows [30, 31]:so that , where c is a positive constant that was calculated experimentally to normalize the output error.

The corresponding Lagrange function will take the following form [30, 31]:where is a new set of Lagrange multipliers, in addition to . Thus the optimization problem is described as follows [30, 31]:

So [30, 31]

The set of optimal weights and the corresponding polarizations are calculated for those to which it holds .

The main disadvantage of the proposed method which uses full supervision is that it requires many classified training examples to construct a prediction model with satisfactory accuracy [32]. This classification of the training body is usually done manually and is a laborious and time-consuming process. To overcome the above problem, this work proposes a semisupervised neural system, where the training process uses the least preclassified data. In general, unclassified data provides useful information for exploring the data structure of the general dataset, while classified data, respectively, provide the learning process. In particular, the process aims to learn a decision rule based on minimally predefined training data. In particular, considering l, the labeled data based on which the algorithm will be trained and, respectively, u, the unlabeled data that are most in the general data set, with , the process of the proposed online-semisupervised learning is described below [3234]:Step 1: The Laplacian graph is created from both parameters Xl and Xu.Step 2: A network is created with nh hidden neurons, random weights, input biases, and the output being calculated.Step 3: The stability parameter C is selected, which determines the degree of correlation of the prediction error between the different classes and the normalization parameter λ, which controls the relationship between the achievement of low error in the training data and the network weights.Step 4: If nh ≤ N, the output weights β are calculated using the following equation:

If nhN, then the output weights β are calculated using the following equation:

Extending the initial thought of the problem of categorization and detection of MQTT-based attacks, we find that this is a dynamic problem with a large amount of available data, of which few are labeled (for this reason, the use of the semisupervised method was chosen). A key hypothesis that extends and strengthens the way of dealing with the problem focuses on the fact that if the proposed algorithm can choose the training data on its own, then it will perform better. This logic leads to the implementation of an online learning system that overcomes the difficulties encountered in data labeling through the submission of appropriate mechanisms, which provide the real label of the most useful unlabeled registrations, creating predictive models of high accuracy, utilizing the best small training datasets [35, 36].

As part of the online learning process of the proposed system, a heuristic probabilistic mechanism for assessing the uncertainty of data is proposed to securely label them. Accordingly, the entries with the highest tag rendering uncertainty are sought, with the calculation based on the ex-post probabilities of all classes based on their entropy, so that [37, 38]where are all possible classes. The fundamental principle on which this strategy is based concerns the minimization of hypothesis space, which corresponds to the total number of hypotheses that are consistent with the minimum amount of labeled data.

The detection of anomalies lies in the identification of patterns that exhibit behavior different from the expected one, which differs substantially from the labeled data. The measurement of the difference in a labeled record for , which essentially identifies the anomaly, is done using vote entropy according to the following equation [37, 39, 40]:where is the number of votes class receives.

Additionally, for measuring the differentiation and confirming the anomaly, the Kullback-Leibler mean deviation is considered, which considers as an anomaly this record that presents the largest mean probability difference and is calculated as follows [41, 42]:with being a specific model that expresses the probability of consensus on the correctness of the label.

In summary, the proposed online-semisupervised neural anomaly detector system initially uses the semisupervised regularized S-hFFNN algorithm C, which is trained in the set of labeled data L resulting in the creation of model ℎ. Then, based on the labeled data L and according to the selected online learning strategy q, new labels m are created from the general dataset U which are integrated in set L, so that

The algorithmic approach of the system in question is described as follows in Algorithm 1.

(1)Input L–labeled data, U–unlabeled data, C–regularized S-hFFNN, q–active learning strategy, m–new labeled data, maxIter–max iterations
(2)
(3)for i from 1 to maxIter
(4)choose m from xU based on q strategy
(5)
(6)
(7)
(8)
(9)end for

4. Experiments

The proposed work aims to create a digital security system linked to the IoT/IIoT and in particular to the MQTT communication protocol to give the research and industrial community a fully realistic framework for its use and implementation. The most relevant dataset that simulates communication and transaction modes in IoT/IIoT, as well as the associated MQTT-based attacks [12, 13], was chosen for the most accurate and realistic picture of how M2M communication works. Specifically, the selected dataset came from the recording of IoT network sensor data based on the application of the MQTT protocol, as applied in real automation conditions in a smart home environment.

Specifically, data on normal and malicious network traffic were collected from 10 different sensors, which communicate at different times by exchanging information about temperature, light intensity, humidity, motion detection, CO gas, smoke, fan controller, door lock, and fan sensor. The behavior of each sensor is different as its characteristics, they are located in two different rooms, they have a dedicated IP address, and their communication port is 1883, while the communication time is periodic or random depending on the type of sensor (e.g., the temperature sensor is periodic, while, on the contrary, the motion detector operates based on an event that activates it so the sending of its information is periodic). Eclipse Mosquitto is used as the MQTT message broker. Each sensor is associated with a topic defined by the sensor when sending data to the broker. Table 1 presents in detail the MQTT sensors with the corresponding information that characterizes them and specifically IP address, room, time, and topic [18].

MQTT traffic is captured in a Packet CAPture (PCAP) file, which is logged as part of the MQTTset data production process. The download time is based on one week (from Friday at 11:40 to Friday at 11:45). The dataset is open to the public and consists of 11,915,716 network packets totaling 1,093,676,216 bytes [18].

At the application level, MQTT operates over the TCP/IP protocol. The exchange of messages takes place between the publisher or subscriber and the broker. Any device connected to the broker can act as both a subscriber and a publisher. The publisher sends the information he wants to share to the broker, defining a specific topic in the message. MQTT devices use specific types of messages to communicate with, such as connect (connection creation with broker), disconnect (termination of connection with broker), publish (publish of data related with a topic), subscribe (subscription to a topic), and unsubscribe (delete from a topic). Those subscribers who are connected to the broker will receive the information using the specific topic. The topics are UTF-8 encoded characters and have a tree-shaped format, thus facilitating the organization and access to data [10, 12].

Respectively, MQTT messages consist of a fixed header (displayed in all messages), variable header (displayed in some messages), and payload (displayed in some messages). The layout of MQTT communication packages includes message type (e.g., connect, subscribe, publish, etc.), flags specific to each MQTT packet (auxiliary flags, the presence, and status of which depends on the message type), and remaining length. The first 4 most important bits of the fixed header are used as specific indicators [10]. A schematic representation of the MQTT packets is shown in Figure 2 [10].

Respectively, the variable header, when exists, contains the data shown in Figure 3 [10].

The payload and the format of the data transmitted via MQTT messages are defined in the application, while, respectively, the size of the data can be calculated by subtracting the length of the variable header from the rest of the package.

In the dataset that was selected to be used in this study, there are 33 features, and the Class (target) includes Flooding DoS, MQTT Publish Flood, SlowITe, Malformed Data, and Brute-Force Attack. The total information was analyzed by Wireshark and allows us to understand the workflow associated with MQTT communication and is presented in Table 2 [18].

To carry out a more thorough analysis that can allow the proposed intelligent system to perform better, without compromising its predictive capacity, the initial dataset was reduced based on evaluative criteria of the information provided in each feature [10, 12, 42]. This stage is critical for this system, since it will be easy and efficient once characteristics that provide meaningful information are chosen.

With better observation, it is found that some features that come from very relevant areas of the headers in the MQTT protocol packet structure provide insignificant information that could be omitted. For example, mqtt.conflag.qos refers to the Quality of Service (QoS) level, which allows the customer to choose a service level that matches the reliability of the network and its application logic. Because MQTT manages message retransmission and guarantees delivery (even when the underlying transfer is not reliable), QoS makes communication on unreliable networks much more reliable. However, in the case we are considering, this information can be omitted as data loss at this level is acceptable, since it is low-priority network traffic, in the context of smart home projects. Respectively, the mqtt.will-xxx information concerns planned or unexpected network disconnections for various reasons such as due to connection loss and power loss; and, in these cases, the information in question does not contribute to the evaluation of the system; as mentioned above, it is a household low-priority network. Thus, knowing the configuration and subfunctions of the MQTT packet structure, it is possible to accurately identify the information that needs to be evaluated to accurately identify cyber attacks. After this heuristic method of degrading the original dataset, the 10 features presented in Table 3 were removed (highlighted by strikethrough text format) [10, 12, 42].

Respectively, feature importance was performed with the Decision Trees method. Specifically, in the feature importance process from Decision Trees the set T includes data that belong to more than one category. The aim is to divide set T into subsets, all data of which belong to only one category. Specifically, we select an appropriate test, which typically uses a single attribute, with a single result in the set . In this way set T is separated into subsets , where subset contains all the data of T for which the result was obtained. In conclusion, the Decision Tree includes (a) a decision node where the selected test is performed and (b) a branch for each result [4345].

The final dataset is presented in Table 4 (removed features were highlighted by strikethrough text format).

It should also be noted that, in this particular dataset, which is divided into 70% training (12,080,355 instances) and 30% test (3,624,106 instances), only 17% of the training dataset labels (2,053,660 instances) were used to test its proposal online-semisupervised system proposed. The results obtained from the categorization process proposed and the final dataset obtained together with the comparative and corresponding methods of anomaly identification and categorization are presented in Table 5 [46, 47].

As shown by the results table, the proposed regularized S-hFFNN algorithm works efficiently and very quickly, surpassing the corresponding competing algorithms [46]. Also, in addition to achieving the smallest error, the proposed algorithm achieves the best generalization, which is attributed to achieving the lower norm of input weights as the lower norm is directly related to the generalization and stability of the model. This algorithm also overcomes various difficulties encountered by traditional algorithms such as overadaptability and entrapment in local minima. This finding is reinforced using regularization, where it normalizes the categorization process ensuring high results even in the case of the use of many neurons in the hidden level.

In general, this observation suggests that the proposed system can perform efficiently for any differentiable or nonlinear activation function. Also, for the hidden nodes of the proposed network, as it turns out the activation function can be any blocked, unstable, or partially continuous function without this being a problem in the process of approaching it. In addition to competing methods, it has been shown that the parameters in each network are better chosen at random than wasting valuable time deciding what the initial values should be, as well as delays due to their recalculation in each iteration. Finally, it is important to note that the proposed regularized S-hFFNN, to which random hidden nodes can be added at random, acts as a universal approximator, reinforcing the idea of building ever-increasing front-end networks without adjustment problems or delays.

Respectively, the technique of semisupervised and especially of the online learning methodology significantly enhances the ways of dealing with and solving the problem of limited distribution of labels. This is particularly appreciated in complex digital security issues where in most cases there are methodical new attacks but which come from a marginally correlated distribution. Also, in the context of the effort to create a realistic operating environment, the proposed algorithm can work optimally in cases of limited resources, with the optimal times to which it performs, while the feature selection process used also contributed to this. In general, the very high results achieved in combination with the general methodology that simplifies and automates the procedures for detecting anomalies in MQTT networks [17, 22] is a very important proposal for the use and utilization of the proposed system.

5. Conclusions

Industrial infrastructures are exposed to new risks due to the vulnerabilities of communication and information technology, which is significantly enhanced by the existing heterogeneity that usually characterizes these systems. In this spirit, and to avoid cybercriminals gaining access to the manufacturing process, which could have serious and possibly irreversible consequences, most industrial companies seek high-performance security solutions to mitigate risks, protect infrastructure, and ensure the privacy of their data.

The innovation and solvency offered by machine learning technologies and, as evidenced by this study, the advanced online-semisupervised learning methods significantly enhance the ways of dealing with modern cyber attacks [48, 49] against industry standards and applications. Especially in cases of use of not completely secure but at the same time very popular protocols, such as the MQTT which was analyzed in this paper, there are a serious legacy of intelligent ways to deal with similar problems.

In conclusion, the most serious innovation of the proposed online-semisupervised neural anomaly detector to identify MQTT-based attacks in real time [50, 51] lies in the fact that the learning algorithm actively participates in the acquisition of knowledge in the selection process of unlabeled data, thus minimizing the time, cost, effort, and resources required when tagging unknown data [52, 53]. Extending this observation and knowing some of the labels of the samples, for each sample, we know which other samples belong to the same class with this.

The distance between a sample and its nearest neighbors of the same class can then be determined. We can deduce that the distance is large because it is an outlier or extreme number. If the distance is minimal, the sample is more likely to be correctly sorted using the proposed sorter. So even in cases where the algorithm fails to categorize a sample correctly, we can move the sample to its nearest neighbors and thus amplify the categorizer from the noise cases contained in the environment in question. This wording can be further strengthened by proposing new features in the system in question which can be extended in this direction.

Finally, summarizing the flexibility and at the same time the simple shape of the proposed regularized S-hFFNN neural system, this system proved to be particularly robust in a completely uncertain and noisy environment [54, 55], creating serious expectations for further utilization and use in an industrial environment, which is also the main future research effort towards its evolution.

Data Availability

The dataset is freely available in the Kaggle repository (https://www.kaggle.com/cnrieiit/mqttset).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.