Abstract

The purpose of this study was to explore the establishment of an auxiliary scoring model for patients with acute pulmonary embolism (APE) complicated with atrial fibrillation (AF) based on random forest (RF) and its application effect. A retrospective analysis was performed on the general data, underlying diseases, laboratory indicators, and cardiac indicators of 100 patients with APE admitted to our hospital from 2018 to 2021. The occurrence of atrial fibrillation in patients with pulmonary embolism was taken as a categorical variable, and the general data, underlying diseases, laboratory indicators, and cardiac indicators were taken as input variables. Then, the risk auxiliary scoring model for patients with APE complicated with AF was established based on RF and logistic regression. Finally, the accuracy, sensitivity, specificity, recall rate, accuracy, F1 value, and the receiver operator characteristic (ROC) curve were used to evaluate the predictive value of the models. After statistical analysis, the optimal node value was 3 and the optimal number of decision trees was 500 in the RF model. The importance of predictors in descending order were Hcy, diabetes mellitus, FT3 level, UA level, left atrial diameter, hypertension, and smoking history. The prediction accuracy of the RF model was 0.934, sensitivity 0.966, specificity 0.876, recall rate 0.9660, accuracy 0.934, and F1 value 0.950. The logistic regression model prediction accuracy was 0.816, sensitivity 0.915, specificity 0.125, recall rate 0.902, accuracy 0.811, and F1 value 0.896. The RF model and logistic regression prediction model AUC values were 0.984 and 0.883, respectively. From this, we conclude that the RF model was better than the logistic regression model in predicting AF in APE patients. So, the RF model had the clinical application value.

1. Introduction

Acute pulmonary embolism (APE) is caused by a thromboembolic artery and its branches or venous system in the right heart [1]. APE can not only cause pulmonary circulation disorder but also cause increased right ventricular load and expansion of the right heart, resulting in atrial fibrillation (AF) [2]. Relevant studies have pointed out that AF-induced abnormal atrial electrical activity can affect patients’ cardiac systolic and diastolic functions, lead to dysfunction of heart pumping, and cause irregular ventricular contraction and electrical response in patients, thus inducing heart failure [3]. Other studies have shown that AF was closely related to the prognosis of APE patients, and AF can promote the further formation and shedding of emboli, thus inducing cardiovascular diseases such as cerebral infarction and myocardial infarction and greatly increasing the risk of poor prognosis for patients [4]. Early warning and early intervention are the keys to reducing AF. However, there is a lack of prediction methods for APE concurrent AF. Age, height, obesity, hypertension, left ventricular ejection fraction (LVEF), and having a history of AF were influential factors for the occurrence of AF in a multivariate logistic regression analysis [5, 6]. However, the logistic regression model is not fit and its accuracy is not high. In addition, the conclusions of each study differ greatly. Random forest (RF) is a classifier that uses multiple trees to train and predict samples. It is a classical machine learning algorithm proposed by Leo Breiman in 2001. In machine learning, RF is a classifier that contains multiple decision trees, and the class of its output is determined by the mode of the class of the output of individual trees. RF can be used to generate highly accurate classifiers, deal with a large number of input variables, and evaluate the importance of variables in determining categories. It can be used for classification and regression analysis with high accuracy and can deal with a large number of input variables and balance errors. This method has been successfully applied to the early prediction and prognosis assessment of various diseases [7]. Therefore, we speculated that RF may also help clinicians predict the risk of AF in patients with APE. This study retrospectively analyzed the general data, underlying diseases, laboratory indicators, and cardiac indicators of APE patients and established auxiliary scoring models for APE patients complicated with AF based on RF and logistic regression, respectively. The application effects of the two models were compared and analyzed. It was expected to provide new ideas for the early detection and treatment of AF in patients with APE.

2. Materials and Methods

2.1. Data Sources

A retrospective analysis was performed on the general data, underlying diseases, laboratory indicators, and cardiac indicators of patients with AP admitted to our hospital from 2018 to 2021. Conditions of selection included the following: (1) according to the criteria in Chinese expert consensus on diagnosis and treatment of acute pulmonary embolism (2015), APE was clearly diagnosed by pulmonary angiography [8]; (2) the time from onset to admission was less than 2 weeks; (3) patients who received normal treatment, such as symptomatic anti-infection and improvement of cardiac function; and (4) normal coagulation function and immune system. Conditions of elimination included the following: (1) combined with serious organ diseases such as congenital heart disease, hypohepatia, and kidney failure; (2) complicated with malignant tumors; (3) complicated with infectious pneumonia or other infectious diseases; (4) complicated with serious cardiovascular and cerebrovascular diseases such as myocardial infarction, cerebral infarction, and so on.; and (5) in-hospital deaths occurred. One hundred patients in total were enrolled, including 57 males and 43 females. The average age was (53.55 ± 3.35) years. The time from onset to admission was 3–12 days, with an average of (7.86 ± 1.47) days. This study was approved by the Ethics Committee of Taizhou Hospital of Zhejiang Province, affiliated with Wenzhou Medical University.

2.2. Data Collection

A uniformly designed case data sheet was used for collecting clinical data from patients. (1) The general data included age, sex, height, time from onset to admission, body mass index (BMI), obesity, pregnant women, risk degree of disease, AF history, APE, drinking history, and smoking history. (2) The underlying diseases included diabetes, hypertension, and hyperlipidemia. (3) The laboratory indicators included D-dimer, albumin, Hcy, UA, creatinine, serum-free triiodothyronine (FT3), and serum-free tetraiodothyronine (FT4). (4) The cardiac indicators included ventricular rate, left ventricular ejection fraction (LVEF), left atrial internal diameter, and right atrial internal meridian.

2.3. Relevant Variables and Definitions

(1) BMI = weight (kg)/height2 (m): 18.5–23.9 kg/m2 was defined as normal, 24.0–27.9 kg/m2 was defined as overweight, and 28.0 kg/m2 was defined as obese [9]. (2) Smoking history: the average smoking volume was ≥1 cigarette per day, and the duration was more than 1 year. (3) Drinking history: drinking at least once a month for more than 6 months. (4) Hypertension: systolic blood pressure ≥140 mmHg (1 mmHg = 0.133 kPa) and diastolic blood pressure ≥90 mmHg [10]. (5) Diabetes: fasting blood glucose ≥70 mmol/L and (or) 2 h postprandial blood glucose ≥11.1 mmol/L [11]. (6) Hyperlipidemia: low density lipoprotein concentration ≥3.37 mmol/L [12]. (7) Detection of D-dimer, albumin, homocysteine (Hcy), and uric acid (UA): before treatment, 5 mL of fasting elbow venous blood was collected from the patients and placed in two test tubes, one of which was centrifuged at 3000 r/min for 10 min with a low-speed centrifuge. The serum was separated and the plasma D-dimer level was determined by immunoturbidimetry. The other one was centrifuged at 4000 r/min for 10 min. The serum albumin level was determined by the colorimetric method, the Hcy level was determined by HPLC, and the UA level was determined by the uric acid oxidase method. The test kits were purchased from Shanghai Canspec Scientific Instruments Co., Ltd., and the test procedures were carried out in strict accordance with the kit instructions.

2.4. AF Evaluation Criteria and Grouping

AF diagnostic criteria [13]: (1) Symptoms: common symptoms include palpitations, fatigue, chest tightness, chest pain, and decreased exercise tolerance. Some patients may not have any clinical symptoms. (2) Signs: auscultation of the first heart sound is not strong or weak, the rhythm is extremely irregular, and the pulse rate is short when the ventricular rate is fast. (3) ECG manifestations: the P wave disappeared, replaced by an atrial fibrillation wave (f-wave) with inconsistent size, shape, and amplitude; the frequency of the f-wave was 350 times/min–600 times/min; absolutely irregular spacing of QRS groups is usually accompanied by irregular shapes and amplitudes. Patients complicated by AF were assigned to the AF group, and patients without AF were assigned to the non-AF group.

2.5. Statistical Methods

Statistical analysis was performed using R4.1.2 language software and SPSS 23.0 software. The measurement data conforming to normal distribution were represented by mean ± standard deviation and an independent sample t-test was performed. The constituent ratios of the count data were expressed as percentages of frequency and were analyzed using the chi-square test or Fisher’s exact probability method. The dataset was randomly divided into a training set and a validation set according to 0.67 : 0.33. The training set was used to establish the RF algorithm and multivariate logistic regression analysis, and the validation set was used to test the prediction effect of the model. was considered a statistically significant difference .(1)Establishment of a multivariate logistic regression model: a single factor analysis was used to screen variables, and the inclusion criterion for variables was . The maximum likelihood method was used to construct a prediction model of APE patients complicated with AF based on a multivariate logistic regression analysis. (2) Establishment of the RF model: use Random Forest package to realize the application of the RF algorithm and specify the random sampling method of Bootstrap, and the default was sampling with putting back. The RF model had two important parameters, which were the number of decision trees, ntree, and the number of preselected variables of split node mtry. The RF warning model was constructed by training set data, and the optimal mtry and ntree were selected to reduce the prediction error rate of the model. The importance function was used to calculate the importance of a model variable, with a larger value indicating that the variable was more important. (3) Model evaluation: application of accuracy, sensitivity, specificity, recall rate, accuracy rate, F1 value, and area under the curve AUC and other indicators were used to evaluate the performance of the risk warning model for APE patients complicated with AF, and a relative operating characteristic (ROC) curve was drawn to compare the size of AUC. The larger AUC (or the closer to 1) was, the better the model’s predictive performance.

3. Results

3.1. Basic Information of the Patients and Univariate Analysis of Risk Factors for AF in Patients with APE

(1)Among the 100 patients in this study, 24 patients were complicated by AF, and 76 patients were not complicated by AF, with an incidence of 24%(2)Univariate analysis showed that there were statistically significant differences in smoking history, diabetes, hypertension, Hcy, UA, FT3, and left atrial diameter between the AF group and the AF group (), as shown in Table 1

3.2. Multivariate Logistic Regression Analysis of APE Patients Complicated with AF

Statistically significant variables in the results of the univariate analysis were included as independent variables in the multivariate logistic regression, and the variable assignment is shown in Table 2. Multivariate logistic regression analysis was performed with AF (yes = 1, no = 0) as the dependent variable. The results showed that the smoking history, diabetes, hypertension, high Hcy, high UA, FT3, and low left atrial diameter were independent risk factors for APE patients with AF. The results of multivariate logistic regression analysis are shown in Table 3.

3.3. RF Model Predicts Model Analysis Results

The results of multivariate logistic regression analysis of APE patients complicated with AF were used to establish the RF model. The relationship between model error and the number of random trees was analyzed to define the number of random trees (ntree) and determine the optimal feature selection (mtry). When Mtry = 3 and ntree = 500, the error was based on stability and the optimal model could be established, as shown in Figure 1.

3.4. Order the Importance of RF Model Variables

This study attempted to use the average reduction of the Gini value as a measure of the importance of variables in the RF model to further clarify the important predictors of APE patients with AF. The results showed that the Hcy score was the highest, indicating that Hcy played a major role in the model. The relatively important predictors in this model were diabetes mellitus, FT3 level, UA level, left atrial diameter, hypertension, and smoking history in sequence. These predictors also played a certain role in the model classification. The analysis results are shown in Table 4 and Figure 2.

3.5. Comparison of Two Prediction Models

Validation set data were used to compare the prediction effect of APE patients complicated with AF based on RF and logistic regression. The accuracy, sensitivity, specificity, recall rate, accuracy, and F1 value of the RF prediction model were all higher than those of the logistic regression model, and the analysis results are shown in Table 5. The AUC of the RF prediction model was 0.984, and the ROC curve based on the RF prediction model is shown in Figure 3. The AUC of the logistic regression prediction model was 0.883. The ROC curve based on the logistic prediction model is shown in Figure 4. The AUC of the RF prediction model was larger than that of the logistic regression model.

4. Discussion

APE is a common clinical emergency with a complex etiology. Previous studies believed that the occurrence of APE was related to factors such as long-term bed rest and chronic lung disease [14]. APE was easy to cause AF, which posed a threat to patients’ life safety, so the hidden danger of APE complicated with AF cannot be ignored [15]. Relevant investigation results showed that the incidence of AF in APE patients was as high as 15%–21%, which was significantly higher than that of healthy peers [16]. In this study, 24 out of 100 APE patients were complicated by AF, and the incidence of AF was 24%, which was similar to previous reports, suggesting that APE patients were still at high risk of complicated AF and clinical vigilance should be raised. Early identification of influencing factors of APE patients complicated by AF and timely targeted intervention management for them had important guiding significance for reducing the probability of the patients complicated by AF. This would help to reduce the economic burden on patients, improve clinical treatment benefits, and reduce mortality [17]. A logistic regression model was used to analyze the influencing factors of APE patients with AF in the past. However, because of the lack of a fitting degree, accuracy of the logistic regression model, and the quite different conclusions of various studies, it was necessary to find a prediction model with higher accuracy to further identify the risk factors of AF.

RF is one of the most important machine learning algorithms at present. The model has been applied to establish the prediction of different diseases and achieved good results. In this study, auxiliary scoring models for APE patients with AF were established based on the RF model and multivariate logistic regression, respectively. The results showed that the RF model could effectively distinguish the individuals with AF and those without AF, and its prediction accuracy was higher than that of the logistic regression model. Moreover, the accuracy, sensitivity, specificity, recall rate, accuracy rate, and F1 value of the RF model were higher than those of the logistic regression model, and the AUC of the two models were 0.984 and 0.883, respectively. In addition, the factors predicted by the RF model were Hcy, diabetes mellitus, FT3 level, UA level, left atrial diameter, hypertension, and smoking history in descending order of importance. Using the effect size analysis of the multivariate logistic regression prediction model, it was concluded that smoking history, diabetes, hypertension, Hcy, UA, left atrial diameter, and FT3 were related to AF in APE patients. (1) Imtiaz and his research team conducted a follow-up investigation on 11047 patients. Finally, they statistically found that the detection rate of AF in smokers was 9.5%, while that in nonsmokers was 7.8% [18]. Their study also found that smoking was strongly associated with a 15 percent increased risk of AF over 10 years. Smoking increased susceptibility to atrial fibrillation through indirect and direct mechanisms. Smokers were more likely to develop abdominal obesity, atherosclerosis, hypertension, diabetes, chronic obstructive pulmonary disease, coronary heart disease, and heart failure [19]. These adverse events were risk factors for atrial fibrillation and could indirectly promote the occurrence of atrial fibrillation. As the main ingredient in tobacco, nicotine could stimulate the body to release hormones such as catecholamines, which activated the sympathetic nervous system and raise blood pressure and heart rate. It had been confirmed that nicotine directly promoted the occurrence of AF [20]. (2) Toxicity caused by long-term hyperglycemia would damage myocardial cells, resulting in abnormal cardiac structure [21]. In addition, insulin resistance and chronic inflammation in diabetic patients would affect cardiac function, which may increase the risk of AF in patients with APE. Therefore, we suggested that hypoglycemic drugs or insulin intervention should be given to patients with APE complicated with diabetes according to their actual situation to regulate their blood glucose level and prevent the occurrence of AF [22]. (3) Long-term hypertension could cause abnormal changes in left atrial hemodynamics, resulting in an increase in thickness and hardness of the left ventricular wall and damage to ventricular diastolic function, thus promoting increased ventricular pressure and ventricular remodeling, which could lead to increased risk of AF in patients with APE [23]. Therefore, for APE patients with hypertension, antihypertensive drug intervention should be given to actively control blood pressure levels and prevent the occurrence of AF. (4) Hcy was an intermediate product of the metabolism of methionine and cysteine [24]. Under normal circumstances, the Hcy would be metabolized by the body, so its serum concentration was low. However, the occurrence of APE would affect the metabolism of Hcy, leading to an increase in serum Hcy levels. Relevant research reports indicated that the increase in Hcy level would lead to the imbalance of the antioxidant and pro-oxidation system, promote the generation of reactive oxygen species, and trigger the oxidative stress response of the body. The resulting large amount of peroxide would lead to cellular calcium overload, resulting in ventricular remodeling [25]. In addition, an increased Hcy level could also promote the generation of various inflammatory factors, damage myocardial cells, and promote the occurrence and development of AF. Therefore, patients with elevated Hcy levels should be vigilant and their Hcy levels should be actively reduced. Folic acid combined with vitamin B6 or vitamin B12 would reduce the risk of AF. (5) UA was the metabolite of purine, which was mainly metabolized by the kidney and had an antioxidant effect [26]. It had been reported that increased UA indicated increased activity of xanthine oxidase, which could activate the protein kinase pathway, promote the generation of inflammatory cells, cause atrial remodeling, and promote S-nitrosylation of ion channel, eventually causing AF. (6) Thyroid hormone had a significant impact on the cardiovascular system [27]. Hyperthyroidism was associated with an increased risk of arrhythmia and hypothyroidism may lead to atherosclerosis. At present, there were few reports on the relationship between FT3 and atrial fibrillation, which required further study. (7) Left atrial diameter was a major indicator of changes in cardiac structure, which could be used to predict the risk of cardiovascular disease. APE could cause increased ventricular load, decrease ventricular systolic function, and cause acute left atrial dilation, finally resulting in an increase in left atrial diameter [28]. The enlargement of left atrial diameter lead to increased atrial pressure, which lead to atrial remodeling and increases the risk of AF.

5. Strengths and Limitations

This study had certain advantages. On the one hand, RF could be used to train all available variables, and hundreds of variables could be directly used in model construction without manual selection of variables in advance. At the same time, relevant influencing factors of APE patients with concurrent AF could be screened out from the perspective of algorithm. On the other hand, the data needed to establish the prediction model could be directly obtained in the clinic, which was conducive to the clinical medical staff to quickly assess the risk of APE patients with AF, so as to realize the early and personalized intervention for high risk and critical patients. However, there were shortcomings in this study, that was, some potential factors that may be related to the occurrence of AF were not included in the analysis, which would affect the accuracy of the prediction model to a certain extent. In addition, this study was a single-center study and lacked the support of external data. Therefore, we had doubts about whether the RF model could also achieve good prediction results when used in a wider range. Considering this problem, future studies should extend the thinking to the external data of the multicenter to enhance the reliability of the results of this study.

6. Conclusions

In conclusion, the RF model was better than the logistic regression model in predicting AF in APE patients. However, considering that the logistic regression model could directly explain the results, we suggested that clinicians combine the two models in practical application to more systematically describe the factors affecting APE patients with AF, so as to provide new guidance for APE treatment and patients rehabilitation.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

Jinhong Wu and Daochao Huang are the co-authors.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Jinhong Wu and Daochao Huang contributed equally to this work.