Abstract

The Lanzhou-Xinjiang (Lan-Xin) high-speed railway is one of the principal sections of the railway network in western China, and signal equipment is of great importance in ensuring the safe and efficient operation of the high-speed railway. Over a long period, in the railway operation and maintenance process, the railway signaling and communications department has recorded a large amount of unstructured text information about equipment faults in the form of natural language. However, due to irregularities in the recording methods of these data, it is difficult to use directly. In this paper, a method based on natural language processing (NLP) was adopted to analyze and classify this information. First, the Latent Dirichlet Allocation (LDA) topic model was used to extract the semantic features of the text, which were then expressed in the corresponding topic feature space. Next, the Support Vector Machine (SVM) algorithm was used to construct a signal equipment fault diagnostic model that reduced the impact of sample data imbalance on the classification accuracy. This was compared and analyzed with the traditional Naive Bayes (NB), Logistic Regression (LR), Random Forest (RF), and K-Nearest Neighbor (KNN) algorithms. This study used signal equipment failure text data from the Lan-Xin high-speed railway to conduct experimental analysis and verify the effectiveness of the proposed method. Experiments showed that the accuracy of the SVM classification algorithm could reach 0.84 after being combined with the LDA topic model, which verifies that the natural language processing method can effectively realize the fault diagnosis of signal equipment and has certain guiding significance for the maintenance of field signal equipment.

1. Introduction

Railway signal equipment mainly includes railway signals, station interlocking equipment, and section blocking equipment. The main function of these types of equipment is to ensure the safety of train operation and shunting work and increase the capacity of the railway. At the same time, it also plays an important role in increasing the economic benefits of railway transportation and improving the labor conditions of railway workers. With the rapid development of information technology and the upgrading of railway signal equipment, the railway bureau has accumulated a large amount of signal equipment failure data. However, the descriptions of the equipment faults in these failure data are mostly recorded in unstructured natural language. Therefore, in the process of dealing with equipment failure, maintenance personnel still use a combination of expert knowledge and personal experience to diagnose the problems. This method usually brings great challenges to on-site technicians to accurately diagnose the faults due to a possible lack of experience of the personnel and an incomplete understanding of the on-site conditions. At the same time, it also causes significant safety hazards due to delays in handling the faults. Therefore, in the era of big data, the application of machine learning and natural language processing methods to diagnose railway signal equipment faults can reduce the technical demands on field maintenance staff. It is of great significance to improve the efficiency of railway signal equipment fault diagnosis and ensure the safe and efficient operation of the railways.

At present, the fault records of railway signal equipment in China consist of unstructured Chinese short text recorded in natural language. They contain a large number of specialized professional vocabulary mixed with numbers, letters, and some special symbols, making it difficult to perform word segmentation and feature extraction. Until now, there has been little research on the fault text mining of railway signal equipment both at home and abroad. Zhao and Xun [1] used the method of text mining to diagnose the faults of on-board equipment and used the traditional Bayesian network method as the fault classification method. Yang et al. [2] used the Synthetic Minority Oversampling Technique (SMOTE) algorithm to intelligently classify unbalanced text data, which solved the problem of sample imbalance to a certain extent. Zhong, Tang, and Wang[3] studied the feature extraction and diagnosis of turnout faults, and ShangGuan et al. [4] used the topic model method to perform fault diagnosis on vehicle equipment. Because the equipment fault recorded in the fault record table is not consistent between each station or section of railway, it has a certain impact on the number and category of postprocessing and statistical faults. Some scholars have proposed [5, 6] to solve the problem of inconsistent maintenance record standards for on-site personnel through standardized documents, which to a certain extent promotes the standardized management of railway big data records. However, there have been no studies on how to effectively interpret such a large amount of analyzed and processed data.

In this paper, the topic model method was used to extract the semantic features of the fault text recorded by the railway signaling and communications department. The topic model is a kind of statistical model that clusters the implied semantic structure of texts set by unsupervised learning. In reference [7], the Latent Dirichlet Allocation (LDA) model was used to mine topics of conference summary text, with different topics being clustered. Peng et al. [8] used the LDA topic model to extract product features in social media reviews and studied the relationship between different reviews and customer sentiment. References [911] introduced the method of the topic model to deal with the task of text classification, and all achieved satisfactory results. Additionally, the topic model can extract the semantic information of text effectively and find the potential correlation information between each document and vocabulary in the text [12].

Based on the analysis of the features of the fault record text, this study used an algorithm based on machine learning and natural language processing and the method of the LDA topic model to extract the word item features and the theme features of the corresponding fault in the fault description text for railway signal equipment. It transformed the corresponding fault document into a theme feature space model, which reduced the dimension of the feature effectively and made the reduced dimension data easier to process and use. Because the recorded fault data had the characteristics of unbalanced distribution, the Support Vector Machine (SVM) classifier was selected to classify the faults. The SVM classifier is not sensitive to unbalanced distribution data and is recognized as one of the most effective models for processing small data samples.

2. Fault Text Analysis of Railway Signal Equipment

Railway signal equipment mainly includes dispatching centralized traffic control (CTC) system equipment, train dispatching command system (TDCS) equipment, train control equipment, interlocking equipment, switch equipment, track circuits, signal machines, snow melting equipment, power supply equipment, etc. There are also many kinds of classification methods for railway signal equipment faults. In this paper, according to the function of the corresponding equipment and the mode of fault phenomenon, it is divided into 10 kinds of faults. These are switch equipment fault, track circuit fault, signal machine fault, snow melting equipment fault, train control equipment fault, CTC equipment fault, TDCS equipment fault, microcomputer interlocking fault, power supply equipment fault, and other faults. To verify the correctness and effectiveness of the model, 1000 fault text data samples recorded by the Lan-Xin high-speed railway signaling and communications department were selected for experimental analysis. Figure 1 shows the distribution of signal equipment failures recorded by the Lan-Xin high-speed railway signaling and communications department from 2017 to 2019.

2.1. Fault Record Data of Signal Equipment

In this study, the fault record data provided by the Lan-Xin high-speed railway signaling and communications department were screened manually. Some fault phenomena that were not recorded in detail, fault diagnosis results that were recorded incorrectly, or unrecorded fault data were all manually removed, making the selected data more conducive to follow-up feature extraction and classification work. Table 1 shows some examples of fault records.

It can be seen from Table 1 that there is a large amount of fault feature information in both the fault overview and cause analysis columns. However, due to differences in personal language habits and maintenance experience of field maintenance personnel, the data forms recorded in natural language have certain differences, which makes the subsequent fault feature extraction and fault diagnosis algorithm more complex and difficult.

2.2. Signal Equipment Fault Diagnosis Process

After the fault is discovered, the on-site maintenance personnel usually take photos of the location where the fault is found and make a detailed text record of the situation when the fault occurs, and then use a combination of personal experience and expert knowledge to diagnose the type of the fault. Figure 2 is a photo of part of the fault location on-site.

Figure 3 shows the process of signal equipment fault diagnosis, which mainly includes fault word item feature extraction, fault topic feature extraction (implied semantic feature), and fault diagnosis. Because Chinese text is different from English text, in which words are separated by spaces, it is necessary to deal with text word segmentation first and then extract the word item features using the bag-of-words (BoW) model. Because of the high dimension of fault records and the loss of semantic information caused by the great difference in the details of recorded fault data, great challenges were encountered in the fault diagnosis. Therefore, this study adopted the topic model method to extract fault features and reduce the dimension of the data and express the fault document in the topic feature space. Finally, the support vector machine was used as the fault diagnosis device to diagnose the signal equipment fault.

3. Fault Text Segmentation and Fault Feature Extraction of Railway Signal Equipment

Proper data preprocessing can effectively reduce the complexity of the algorithm for signal equipment fault diagnosis and improve the accuracy of the diagnosis. In the field of natural language processing, the traditional method usually expresses the original text data in the vector space model (VSM) of word items. The dimension of the vector is the number of words. This model ignores word order, grammar, and syntax and loses part of the semantic meaning. It only counts the number of times each term appears in different documents. For the fault text of railway signal equipment in this paper, its original feature is the fault word feature. However, using the bag-of-words model to express the signal equipment fault will ignore the relationship between the feature of fault words and the topic, so it will reduce the accuracy of the subsequent fault diagnosis.

In recent years, a statistical model for discovering hidden topics in a series of documents has been widely used in many fields, such as user evaluation, social media, emotional classification, medical evaluation, user preferences, etc. [1317]. Therefore, to improve the shortcomings of the traditional bag-of-words model and improve the automatic degree of feature extraction and adaptability for later fault diagnosis, this study used a combination of Chinese text word segmentation [18, 19] and topic models [20, 21] to extract the theme after word segmentation. The steps of word segmentation and feature extraction are shown in Figure 4.

3.1. Generation of Fault Lexicon

Since the original fault text of signal equipment is recorded in the form of natural language, it is necessary to use Chinese word segmentation technology to process the original document. The core process of the Chinese word segmentation algorithm is to input the Chinese text to be segmented, then identify the stop vocabulary to remove the useless words in the document, use the user-defined dictionary to correctly segment specific vocabulary, and finally output the optimal word segmentation result. The flow chart of the Chinese word segmentation algorithm is shown in Figure 5.

In this study, the Jieba text segmentation package composed in the Python language was used to segment the signal fault text. The Jieba algorithm uses a prefix dictionary to achieve efficient word graph scanning and generates a directed acyclic graph (DAG) composed of all possible generations of Chinese characters in a sentence. Then it uses dynamic programming to find the maximum probability path, and based on word frequency, finds the largest segmentation combination. For unknown words, a Hidden Markov Model (HMM) based on Chinese character word-forming ability was adopted, and the Viterbi algorithm was used. The general vocabulary of the Chinese word segmentation tool does not contain a professional vocabulary in the railway signal field. Therefore, if it was directly used to segment the data in this paper, it would lead to incorrect results or undivided words, which would make the result of segmentation unsatisfactory, and then affect the subsequent feature extraction.

As a result, the lexicon related to railway signal equipment faults needed to be manually selected. In principle, all the word items in the document could be used for training, but some of them had no essential connection with the fault. For example, some station names, staff member names, “and,” “or,” and other words and punctuation marks needed to be de disabled. At the same time, some words in the field of railway signals, such as a red-light strap, point machine, manual release, block, and other words that have important relationships with fault types, did not exist in the dictionary of the general domain. In the process of word segmentation, it was necessary to establish a user-defined vocabulary list in the field of railway signaling according to expert knowledge. The corresponding words were added to the vocabulary list so that the segmentation results could be more accurate. Figure 6 is a sample of some vocabulary in the field of railway signaling. After analysis and summary, a total of 98 related terms were summarized.

After the Chinese word segmentation was performed on the signal fault documents, the corresponding railway signal fault dictionary could be obtained. Afterward, the unstructured text was represented as a structured vector through the VSM model. The generation of the term-document matrix is shown in Figure 7.

3.2. Fault Feature Extraction of Railway Signal Equipment Based on the LDA Topic Model

Because the traditional word feature space dimension will increase with an increase in the word list size, the vector is quite sparse. Secondly, the word feature space cannot deal with the problems of “one-word polysemy” and “one meaning many words,” which greatly increases the complexity of subsequent fault diagnosis. To combat this issue, the topic model method has developed rapidly in recent years [2224]. The topic model tries to realize the representation of the text from the perspective of the probabilistic generative model. Each dimension is a “topic,” and this topic is usually a cluster of words. Therefore, it is possible to guess the semantics represented by each dimension through the topic. It is explanatory and can transform the document from the lexical feature space to the topic feature space so that the shortcomings of the traditional lexical feature space can be solved. This study used the LDA topic model algorithm to extract the features of railway signal equipment fault records.

3.2.1. LDA Model

The LDA topic model is a Bayesian model of the Probabilistic Latent Semantic Analysis (PLSA) algorithm. It is composed of a three-layer structure of documents, topics, and words. It is an unsupervised machine learning technology and is often used in text topic recognition, text classification, and text similarity calculation. It assumes that a document has multiple topics, and each topic corresponds to different words. The process of constructing a document is to first select a topic with a certain probability and then select a word under this topic with a certain probability, so that the first word of the document is generated. If this process is repeated continuously, the whole article will be generated. From documents to topics, and topics to terms, all obey the polynomial distribution. The probability graph model of the LDA model is shown in Figure 8.

In Figure 8, M represents the total number of documents included in the training, and each document has N number of words, K represents the number of topics, is the main body distribution of document , and is the word distribution of topic . Dirichlet is the prior distribution of parameter , which is used by the LDA model to sample and generate the topic distribution corresponding to the document. Dirichlet is the prior distribution of parameter , and the word distribution corresponding to the topic is generated by sampling in the model [25]. The relationships of all variables in the model are as follows:

By using the joint probability distribution, the conditional distribution of hidden variables under a given observation variable value is calculated [26].

When analyzing the original data, the entire document set is used as the input content for LDA training to obtain the topic distribution of each record.

3.2.2. Feature Extraction of Railway Signal Equipment Fault Subject

When the topic model is applied to extract the theme features of railway signal equipment faults and transform them into the topic document matrix, the number of subject features K must be input. Generally, the size of the K value needs to be given before training the corresponding model, and there is no fixed method to determine it. Therefore, a certain degree of subjective experience is needed to determine the value of topic number K. The simple method to determine the optimal K value is usually repeated experiments with different K values. When the evaluation function (such as the precision of the classifier) reaches the optimal value, the K value is considered to be optimal. If the value of K is too small, the precision of classification will be relatively low, and when the value of K is too large, the purpose of denoising will not be achieved. Figure 9 shows the change of SVM classifier precision for different numbers of topics. In 20 experiments, the precision rate twice exceeded 80%: 0.801 at K = 10 and 0.840 at K = 17. Therefore, K = 17 was selected as the optimal number of topics.

After the value of the number of topics K was determined, the LDA topic model was used to perform dimensionality reduction and feature extraction processing on the term-document matrix. After dimensionality reduction of the term-document matrix, the components in the topic space had semantic meaning and corresponding characteristics. Table 2 shows the results of fault records for signal equipment of the Lanzhou-Xinjiang high-speed railway after feature extraction using the LDA topic model.

It can be seen from Table 2 that Topic T1 is related to the fault of “loss of indication of a switch,” Topic T2 represents a “switch idling” fault, Topic T3 is related to a “transponder fault,” and Topic T4 corresponds to a fault in “wireless communication.” Figure 10 illustrates the expression process of the fault document in the topic feature space.

4. Fault Diagnosis of Railway Signal Equipment Using the Support Vector Machine

The support vector machine (SVM) is a widely used classifier method in classification and regression analysis and has achieved satisfactory results in many applications [2729]. There are many kinds of faults in railway signal equipment, the distribution of the various faults is unbalanced, and the sample number of some faults is very limited. Support vector machines are advantageous when dealing with small sample data sizes and are less sensitive to unbalanced data. The SVM can rely on a small number of support vectors to perform the corresponding diagnosis. Therefore, this study used the learning algorithm of the support vector machine to diagnose the faults of railway signal equipment.

4.1. Description of the SVM Algorithm

The SVM method is based on the Vapnik-Chervonenkis (VC) dimension theory of statistical learning theory and the principle of structural risk minimization. It seeks the best compromise between the complexity of the model and the learning ability according to the limited sample information to obtain the best generalization ability. The SVM model maps instances to points in space and then distinguishes instances of different categories at obvious intervals. SVM can not only deal with linear classification but can also introduce kernel functions to classify nonlinear problems. If a dichotomy is taken as an example, the classification diagram is shown in Figure 11.

Among them, the classification decision function is as follows:

To improve the classification effect, the optimal classification plane can be obtained by maximizing the classification interval.

Equation (4) can be solved by introducing the duality principle of Lagrange’s theorem.

In practice, there will be some data points that deviate from the decision plane, so the relaxation variable is introduced here. Therefore, equation (5) can be transformed into the following:where C is the penalty factor, which is used to constrain the relaxation variable.

The partial derivative of equation (6) can be found and then rearranged.

To obtain the value of vector a, the sequence minimum optimization (SMO) algorithm [30] is introduced. Knowing the value of a, the value of can be obtained. According to equation (9), b can be obtained.

By transforming the data into data points and then mapping the corresponding points to the high-latitude space, the linear inseparability situation can be solved easily to realize the classification of the data. However, nonlinear mapping in a high-dimensional space will greatly increase the number of calculations, so introducing the concept of a kernel function into the support vector machine can effectively reduce the number of calculations. In this study, the Gaussian kernel function was used to implicitly map its input into high-dimensional space to make it more effective for nonlinear classification.

The above mainly describes the situation of the support vector machine for a dichotomy, but in reality, most data points will have more than just two types. At present, there are two main ways to realize multiclassification in an SVM algorithm: the one-vs-rest method (OVR SVM) and the one-vs-one method (OVO SVM).

4.2. Fault Diagnosis of Railway Signal Equipment Based on the SVM Algorithm

In this study, a one-vs-one multi-classification method (OVO SVM) was used to train the model. The subject document matrix and the fault category label corresponding to the signal equipment fault document were used as the training data of the model and input into the classifier to obtain the corresponding SVM diagnostic model. For a new signal equipment fault document, the document was first preprocessed, then the preprocessed document was transformed into the topic feature space. The previously trained SVM diagnostic model was then input to diagnose the new fault so that the type of fault that occurred could be obtained to realize the fault diagnosis of the railway signal equipment. The corresponding diagnosis process of the SVM is shown in Figure 12.

5. Experimental Analysis

5.1. Evaluation Index of the SVM Algorithm

In the design process of the classifier, an index is usually needed to evaluate the performance of the classifier, which is not only beneficial to the subsequent optimization of the classifier but can also intuitively show the quality of the classification effect.

Precision is usually used to evaluate the performance of a classifier. The precision rate represents the proportion of samples that are positive to the samples predicted by the model to be positive. However, when the distribution ratios of different categories of sample data differ greatly, the precision rate cannot fully reflect the classification effect of a classifier. That is because when the data distribution is not balanced, the sample data of a small proportion category is easy to be wrongly classified into the sample of a large proportion category. Because there is such a small amount of data in a small-scale sample, even if the classification effect of the small-scale sample is poor, it will not have a great impact on the overall precision of the classifier. To make up for this shortcoming, the F1-measure was introduced in this study as another evaluation indicator to evaluate the effect of the classifier more comprehensively.

The F-measure is calculated by Precision and Recall. The precision rate refers to the proportion of correctly predicted positive samples to the total number of predicted positive samples. The precision rate can directly show the rate of accurate samples within the classification samples.

In the above equation, TP is the number of samples correctly categorized into this class, and FP is the number of samples wrongly assigned to this class.

The recall rate refers to the percentage of the predicted positive samples in the actual positive samples.

In equation (11), FN represents the number of samples that belong to this category but have been misclassified into another category.

When considering the classification effect of each failure separately, there will often be contradictions between Precision and Recall. The F-measure is the weighted average of Precision and Recall, so it can effectively solve the above contradiction. The expression for the F-measure is as follows:

The following equation shows how the F1-measure used in this study was obtained, for when .

5.2. Effect Analysis of the Fault Diagnosis Experiment

To verify the proposed feature extraction and fault diagnosis algorithm, the fault diagnosis program was written using the Python programming language, and the corresponding experimental test and verification were performed on a PC configured with an Intel Core i7-6700HQ 2.60 GHz CPU and 16 GB memory.

To ensure that it was consistent with the actual fault diagnosis of the railway signaling and communications department, the 1000 fault text data samples used in the experiment were all taken from fault data recorded by the Lan-Xin high-speed railway signaling and communications department from 2017–2019.

The Naive Bayes (NB) algorithm, logistic regression (LR) algorithm, random forest (RF) algorithm, K-nearest neighbor (KNN) algorithm, and support vector machine (SVM) algorithm were selected to be used in the experiment. The widely used word-space model and the LDA topic model were used to extract the features of the data set, and then the two processed data sets were trained and tested. Through the Precision, Recall, F1-measure, and other indicators, the diagnosis effect was evaluated, analyzed, and compared to verify the impact of the LDA topic model on the performance of the fault diagnosis. To prevent overfitting problems, this experiment randomly used 70% of the data set to train the classifier, and the remaining 30% was used as the test set.

5.2.1. Experiment on Spatial Classification of Word Features

The fault text data was expressed in the traditional vector space model. Then the processed data was input into the classifier for training. The classification effect is shown in Table 3.

It can be seen from Table 3 that under the traditional word feature space method, the KNN algorithm had the best classification effect, while the RF algorithm in the ensemble classifier was the least effective.

5.2.2. Experiment of Topic Feature Space Classification

After using the LDA topic model to extract fault features from the same original fault data, the classifier was trained and tested again. The classification results are shown in Table 4.

As Table 4 shows, after the LDA topic model extracted the fault record text features and transformed them into the topic feature space, the classification indexes of most classifiers were improved to a certain extent.

5.3. Experimental Analysis of Fault Diagnosis Algorithm

Through a comparative analysis of the above experiments, it can be seen that the method of fault diagnosis of railway signal equipment based on the LDA and SVM models is better than the established method of combining word feature space with various classifiers. It shows that the method based on machine learning and natural language processing proposed in this paper has certain advantages in the fault diagnosis of railway signal equipment and can provide a certain reference value for railway signal equipment fault diagnosis.

6. Conclusions

In this paper, based on the characteristics of the text data of fault records of railway signal equipment from the Lan-Xin high-speed railway, a corresponding fault feature vocabulary in the field of railway signal equipment was constructed. The LDA topic model was used to extract the fault features. Then the fault record text was transformed into the topic feature space and compared with the commonly used word feature space. Subsequently, an SVM diagnostic device was constructed based on this to diagnose the fault. Then it was compared with the NB, LR, RF, and KNN algorithms. The accuracy and effectiveness of the proposed model were verified by field data experiments. It is proved that the model can help the field personnel to quickly diagnose and maintain the railway signal equipment failure and has certain guiding significance.

In future research, we will try to process the text records for a single type of signal equipment failure and test the influence of different classification algorithms on the corresponding text evaluation indicators to achieve further diagnoses of certain types of signal equipment failure. At the same time, this method is also suitable for a large number of textual information on failure phenomena recorded in a variety of other industrial production activities and has a wide range of application prospects.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Foundation of China (51967010) and the Key Program of Natural Science Foundation of Gansu Province, China (20JR5RA428) (20JR10RA218).