Abstract

Accurate reporting and prediction of concentration are very important for improving public health. In this article, we use a spectral clustering algorithm to cluster 44 cities in the Bohai Rim Region. On this basis, we propose a special difference equation model, especially the use of nonlinear diffusion equations to characterize the temporal and spatial dynamic characteristics of propagation between and within clusters for real-time prediction. For example, through the analysis of concentration data for 92 consecutive days in the Bohai Rim Region, and according to different accuracy definitions, the average prediction accuracy of the difference equation model in all city clusters is 97% or 90%. The mean absolute error (MAE) of the forecast data for each urban agglomeration is within 7 units . The experimental results show that the difference equation model can effectively reduce the prediction time, improve the prediction accuracy, and provide decision support for local air pollution early warning and urban comprehensive management.

1. Introduction

refers to particulate matter with a diameter less than or equal to 2.5 in the atmosphere, also known as fine particulate matter. Although the content of in the atmosphere is sparse, it has a significant impact on air quality and visibility. Studies have shown that, is the main source of a variety of respiratory diseases [1]. Therefore, the accurate prediction of is not only conducive to the monitoring of the existing governance effect but also provides direction for the further development of air governance in the future and can provide the people with the best travel time.

There are many studies on prediction [2, 3]. Each method deals with the problem from a different perspective. Among them, statistical methods and satellite remote sensing techniques are the most widely used. The statistical method is an empirical prediction method. Common statistical methods include linear regression models [4, 5], neural networks [3], and nonlinear regression models [6, 7]. Although the statistical method is convenient and simple to operate, it needs to collect a large amount of data in advance, and the data processing speed is slow. Although satellite remote sensing techniques [2] have wide coverage and a long time, the equipment cost is high, and it is not suitable for predicting data in a small area for a long time.

Due to the development of applied mathematics, the use of equation models to study the propagation laws and development trends of atmospheric pollutants such as has become an extremely important subject in biomathematics research. There have been many more mature studies in recent years [810]. For example, Wang et al. [11] first established a partial differential equation model based on space-time dimensions to predict , and then in 2020, they predicted based on a data-driven ordinary differential equation model [12]. However, these models are all differential equation models, and one of the most important assumptions for using differential equation models is the continuity of time. But in reality, the collected data are all discrete, so the establishment of a differential equation model has certain errors. The difference equation is the discretization of the differential equation, so the differential equation model is also a powerful tool for predicting data. This article uses the difference equation model to predict and analyze .

This work aims to explore large-scale (between urban areas) air pollution migration and make further predictions. Specifically, we build a specific difference equation model based on the network and clustering of 44 cities in the urban agglomeration in the Bohai Rim region, combined with local emissions and global diffusion, which is used to describe the temporal and spatial dynamic propagation process of in the region. For this model, no large-scale calculations are required. At the same time, the simulation results show that the model not only has good predictive ability but also can provide policy insights to a certain extent, providing a more scientific theoretical basis for controlling air pollution in the Bohai Rim Region.

This article mainly uses multisource data to make short-term forecasts of . This study uses data on geographic distance, wind direction, wind speed, and concentration between cities. Figure 1 shows the framework of this research. The main content of this paper includes the following parts: Section 2 gives the research area, clusters the research area according to geographical distance, wind direction and wind speed, and other conditions, and constructs the difference equation model of spatiotemporal propagation; Section 3 gives related prediction results of and the error analysis of the prediction results are carried out; Section 4 gives the summary discussion of this article and some thoughts on the later work.

2. Materials and Methods

2.1. Study Area

The Bohai Rim Region refers to the Bohai Rim coastal economic belt dominated by the Liaodong Peninsula, Shandong Peninsula, and Beijing-Tianjin-Hebei. This area accounts for about 13.31% of the country’s land area and 22.2% of the total population. At present, the concentration of is extremely high in densely populated urban agglomerations such as Beijing, Tianjin, and Hebei. Therefore, urban agglomerations in the Bohai Rim Region are facing serious problems. Figure 2 shows the study area. The red markers on the map represent the 216 major air monitoring stations in the area, and the black markers represent all prefecture-level cities in the area.

2.2. Fine Particulate Matter () Data

The city clusters in the Bohai Rim Region in this study include 44 prefecture-level cities in 5 provinces of Beijing, Tianjin, Liaoning, Hebei, and Shandong. The research data used in the study covered 92 days of concentration data from July 1, 2020, to September 30, 2020. The average daily levels for each cluster were calculated based on the daily levels of all cities in the cluster. Specifically, we calculate the average concentration of each prefecture-level city based on the data collected from 216 monitoring stations and then calculate the average concentration of each city cluster based on the concentration of each prefecture-level city concentration. All research data comes from the National Urban Air Quality Real-Time Release Platform of China Environmental Monitoring Station. The original concentrations of each city are normalized to a discrete level value 1, 2, …, and 6, according to Ambient Air Quality Standards (GB3095-1996) of China, where concentrations are divided into 0–35, 36–75, 76–115, 116–150, 151–250 and greater than 250 and these different concentration ranges are leveled from 1 to 6, describing that air quality is good, mild, moderate, severe, highly severe, and seriously severe.

2.3. Clustering and Embedding

In the study of regional transport of , we divided 44 cities in the Bohai Rim Region into four city clusters so that we could conveniently put forward a specific difference equation model to describe the transmission process of within and between clusters. The motif in Figure 3 reflects the movement of from the source of infection to the target in city network. We use as the basic module of the complex network and use the high-order spectral clustering algorithm in [13] to divide the urban agglomeration in the Bohai Rim Region into four clusters, as shown in Figure 4. For related work on high-order spectral clustering, see [14].

As mentioned above, we cluster the 44 cities in the Bohai Rim Region into 4 disjoint sets through the high-order spectral clustering algorithm, i.e., ; we will order in a meaningful way [11]. For general clustering partition, the spatial arrangement of can be based on specific modeling goals and social or geographical characteristics of the underlying network. In [15, 16], the level of democracy, diaspora size, international economic relations, and geographical proximity are used to order . In [17], friendship hops are used to define distance metric, then is embed at location based on that -axis being used as the social distance. But for , meteorological conditions are the most important factor affecting concentration. Therefore, in this study, we sorted these sets according to wind direction. From July to September, the prevailing wind direction in the Bohai Rim has been southerly, so the four city clusters are projected from south to north on the -axis of the Cartesian coordinate system and the geographic locations are named to , as shown in Figure 4.

2.4. Model

As shown in Figure 5, the pollution sources in each city cluster have a greater impact on the cluster (local emission), and different city clusters also influence each other through factors such as air flow (global transport). Therefore, this paper proposes a difference equation model with time and space factors to describe the dynamic propagation process of . For a city cluster, factories, cars, etc., in the cluster will generate a large amount of . The generation and dissipation of in the cluster can be regarded as local emission. When in a city cluster spreads to another or multiple clusters along with airflow and other factors, it can be regarded as a global transport.

In the following, we propose a nonlinear difference equation-based model to abstractly translate the transport into two processes: local emission and global transport (in Figure 5). The local emission reflects the diffusion within the cluster and the underlying network structure and is directly related to the cluster. Global transport is the spread of between clusters due to airflow and other factors, usually manifested as a more or less random walk. This approach will extend our analysis of difference equation modeling results.

Following is the description of the difference equation modelwhere(i) represents concentration in the area at time .(ii) represents the first-order forward difference of with respect to time , i.e., .(iii) represents the regional transport (global transport) of between different clusters, where .(1) describes the transport ability of the cluster at location . Different city clusters have different transportation capabilities, so a piecewise function is used to represent ; the value of each segment needs to be determined according to the actual situation.(iv) represents the spread process (local process) within a cluster. Where and are real numbers greater than 0. This mathematical expression has been used to describe and predict the dynamics of various populations, such as the growth of bacteria and tumors [18].(1) is the growth rate with time in the local process. It depicts dissipation with the external changing factors such as wind or certain other atmospheric conditions [19, 20]. Therefore the form can be expressed as are parameters, Their optimal value will be determined by the actual data collected by us.(2) is the carrying capacity of the system (the maximum possible volume of at a given location ).(v) is the initial function ( concentration at time to be , which specifies that the initial function has to be always ).(vi) is the Neumann boundary condition, which means that there is no flowing in or out at the positions and , where , i.e., at positions and , the concentration inside the cluster and the outside concentration reach an equilibrium state.

2.5. Accuracy Definition

By comparing the predicted value calculated by the model with the actual observation value, the difference equation model can be continuously optimized. denotes the actual value of concentration level and is the predicted value of concentration level.(i)(Mean Absolute Increment Accuracy), which is proposed in this paper based on the practical significance, is defined as follows:where is the number of sample points in test data set and AIA evaluates the absolute accuracy at each sample point. There are totally six concentration levels from level one to level six, and AIA describes the absolute accuracy in the view of level length [11].(ii)(Mean Relative Accuracy), which is defined as follows:where is the number of the sample points in the test data set.

2.6. Error Definition

All of the experimental results are presented in this section, and they are evaluated using three criteria: the mean absolute error (MAE), the mean absolute percentage error (MAPE), and the root mean square error (RMSE), where denotes the actual value of concentration, and is the predicted value of concentration. which are computed as follows:(i)(Mean Absolute Error): MAE is the average of the absolute error, which is defined as follows:The smaller the value of MAE, the better the accuracy of the model, which better reflects the actual situation of the predicted value error.(ii)(Root Mean Square Error): RMSE is used to measure the deviation between the observed value and the true value. RMSE is more sensitive to outliers. RMSE is defined as follows:(iii)(Mean Absolute Percentage Error): MAPE is used to measure the relative error between the predicted value and the actual value to measure the accuracy of the model. MAPE is defined as follows:

3. Results

Using the difference equation model proposed in this paper, the actual concentration, concentration level, and absolute error of in the Bohai Rim Region from July 1, 2020, to September 30, 2020, for 92 consecutive days are used to verify the real-time prediction effect of the model.

3.1. Prediction Accuracy

After collecting data from 44 cities in the Bohai Rim Region, we proposed the following prediction process: First, we normalized the data to reduce experimental errors and calculated the daily average concentration and corresponding concentration level of each cluster. We used the first day’s data to construct the initial data, and then used the three-day training data set to predict the concentration level on the fourth day. That is, we used 1–3, 2–4, 3–5 days as training data, and predicted the data on the 4th, 5th, and 6th days accordingly, and recorded the 4th, 5th, 6th… prediction accuracy of all 4 regions in the day. Specifically, we took the detailed forecasting process on day 4 as an example. The data from the first day is used to build the initial functionality. Next, we calculated the concentration change data from day 1 to day 3 and used the concentration change data on days 1–3 to calculate the parameters in the model through the lsqcurvefit function in Matlab. Finally, we used the obtained parameters to predict the data on day 4.

Figure 6 shows the predicted results of concentration levels in four city clusters from July 1, 2020, to September 30, 2020. By observing the image, it can be found that the difference equation model can effectively predict the concentration level, and the obtained prediction curve (represented by the blue line) is roughly the same as the actual curve (represented red the blue line). Therefore, the difference equation model in this article provides an accurate estimation of the concentration level, and the predicted trend is basically consistent with the actual change trend.

According to the definition of accuracy, we divide it into mean relative accuracy and mean absolute increment accuracy. In this study, the mean absolute increment accuracy reflects a precise definition of the concentration value range, while the mean relative accuracy reflects a precise definition of the concentration value. The prediction accuracy of each city-cluster for a total of 92 days from July 1, 2020, to September 30, 2020, is shown in Figures 7 and 8. According to the precision definition of MAIA, in Figure 7, most of the “” marks are located above the horizontal line 0.9, which means that the predicted value of most days in each cluster is higher than 90%, and we get that the average accuracy of the 4 clusters is all higher than 95%. Compared with Figure 7, the “+” in Figure 8 is not as dense as in Figure 7. However, through observation, it is found that the “+” in Figure 8 is mostly above the horizontal line 0.85, which shows that the relative accuracy of each cluster is higher than 80%. And through Figure 7 and Figure 8, it can be found that the MAIA and MRA of all clusters are higher than and .

3.2. Error Analysis

Figure 9 is a line graph of the predicted and actual values of concentration of cluster 1 on the right and the histogram of the actual error of cluster 1 on the left. As can be seen from the line diagram in Figure 9, the coincidence degree between the actual value curve and the predicted value curve is very high. As can be seen from the error histogram, the proportion of days with an error of 5 units is 46.94%, and the proportion of days with an error of 10 units is 84.78%. It can show that the model has good predictive performance. The prediction line graphs and error histograms of the remaining clusters are shown in Figure 10. Figure 11 shows the prediction error histogram of the model, which can prove the effectiveness and stability of the developed model. It can be seen from Figure 11 that the MAE, RMSE, and MAPE values of each city-cluster are relatively small. In summary, it can be seen that the overall performance of the difference equation model is relatively good. It can not only accurately predict the concentration of , but also provide stable data.

4. Discussion

This paper adopts a prediction method of a difference equation model based on a spectral clustering algorithm. Mainly through the analysis of weather conditions such as wind speed and wind direction in a region, a spectral clustering algorithm is used to divide a region into several city-clusters. By analyzing the relationship between in these several city-clusters, a specific difference equation model is established to describe the global and local propagation process of , thereby predicting the concentration of each cluster. After testing, the difference equation model based on the spectral clustering algorithm has high prediction accuracy and strong significance. The prediction of is basically the same as the actual observation value. It has certain practicability and can provide people with choices for travel, but there are certain defects that still need to be improved continuously in the application.

However, studies have shown that the PM2.5 concentration is also affected by weather factors (such as temperature, humidity, wind speed, and precipitation) and other particulate matter indicators (such as CO, NO, and SO2). Especially rainfall has a huge impact on concentration. Therefore, the next step will consider adding more weather factors and other particulate matter index data to improve the prediction accuracy of concentration.

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

XH designed the study and carried out the analysis. XH and CL contributed to writing the paper. CG revised the wording of the article. CL performed numerical simulations. All authors read and approved the final manuscript.

Acknowledgments

This work was supported by the NNSF of China (No. 11561063), Natural Science Foundation of Gansu, China (No. 20JR10RA086), China Postdoctoral Science Foundation (2018M640232), and Natural Science Foundation of Tianjin, China (19JCQNJC14800).