A practical framework for predicting residential indoor PM2.5 concentration using land-use regression and machine learning methods
Graphical abstract
Introduction
Air pollution has become a global public health issue (Kioumourtzoglou et al., 2016; Li et al., 2020a; Weber et al., 2016; Yim et al., 2015; Yim, 2020). Particulate matter with an aerodynamic diameter ≤2.5 μm (PM2.5) has been reported to be positively associated with adverse effects on respiratory and cardiovascular health (Hopke et al., 2020; Kioumourtzoglou et al., 2016; Liu et al., 2018; Weber et al., 2016), causing substantial amount of premature morality every year (Gu and Yim, 2016; Hou et al., 2019; Yim et al., 2019a). People spend an average 80%–90% of their time indoors (Ren et al., 2017; Rivas et al., 2019; Xie et al., 2020). It is of importance to determine indoor PM2.5 concentration to evaluate personal exposure to PM2.5 and protect human health (Faria et al., 2020; Han et al., 2015; Yuchi et al., 2019). Measuring indoor PM2.5 concentration with strict quality assurance/control procedures provides accurate data for air quality and public health studies (Faria et al., 2020; Han et al., 2015; Zhou et al., 2016), but is both expensive and time consuming. In addition, it is almost impossible to conduct indoor PM2.5 measurements for a large number of subjects (e.g., a few thousand people or more) or for a long study period (e.g., over years) (Tong et al., 2019; Yuchi et al., 2019).
Indoor PM2.5 originates from the infiltration of ambient PM2.5 and emissions from indoor sources (e.g., cooking and smoking) (Ji and Zhao, 2015; Kalimeri et al., 2019; Tong et al., 2018; Zhao et al., 2015). Indoor PM2.5 concentration can be therefore predicted based on the measured or modeled concentration of ambient PM2.5, the influencing factors describing the infiltration of ambient PM2.5 and a number of influencing factors describing the strength of indoor emission sources of PM2.5 (Elbayoumi et al., 2014; Tong et al., 2019). Indoor PM2.5 prediction models can be divided into two categories: physically based mechanistic models and statistical models (Wei et al., 2019). It is typically difficult to establish mechanistic models because they require detailed and complex input data, such as information on exact strength of indoor sources and sink materials for particles (Wei et al., 2019; Xie et al., 2020). Statistical models can achieve a relatively good prediction performance with smaller amounts of input data.
A number of previous studies have successfully developed statistical models to predict indoor PM2.5 concentration (Elbayoumi et al., 2014; Tong et al., 2019; Yuchi et al., 2019). Most of these studies were conducted based on the assumption that indoor PM2.5 concentration is linearly correlated with the predictor variables. These models may not fully capture the complex relationship between predictor variables and indoor PM2.5 concentration, resulting in biases of predictions against observations (Yuchi et al., 2019). In recent years, the machine learning approach has been applied to the prediction of indoor PM2.5 concentration (Elbayoumi et al., 2015; Yuchi et al., 2019). For example, Elbayoumi et al. (2015) reported that a machine learning model based on a feed-forward, back-propagation neural network outperformed a multiple linear regression (MLR) method in predicting seasonal indoor PM2.5 concentration in naturally ventilated school classrooms in the Gaza Strip. Yuchi et al. (2019) modeled indoor PM2.5 concentration in apartments in Ulaanbaatar, Mongolia using a random forest machine learning method and found that their model performed better than a conventional MLR model in terms of prediction, but had a similar performance to MLR in cross-validation. Most of these indoor PM2.5 prediction studies usually used outdoor PM2.5 measurements or nearby fixed-site station PM2.5 measurements as predictor variables (Elbayoumi et al., 2014; Yuchi et al., 2019). As a result, these established models had poor transferability to other areas because outdoor PM2.5 measurements require large efforts (Han et al., 2015; Qi et al., 2017) and the use of nearby station measurements may bring in a large bias due to the spatial variability in outdoor PM2.5 concentration (Che et al., 2018). It is necessary to apply a modelling approach to estimate outdoor PM2.5 concentration for the use of indoor predictions. In addition, reported studies have only used the cross-validation method to evaluate established prediction models, which may not be applicable due to the similarity of the datasets (Li et al., 2020b). It is therefore essential to assess the models using independent datasets.
Tong et al. (2019) used linear mixed-effect regression (LMR) to predict indoor PM2.5 concentration in 116 households in Hong Kong, with a moderate R2 value of 0.67. The main objective of the present work was to develop a residential indoor PM2.5 prediction model with the same dataset using a combined land-use regression and machine learning approach and to compare the results with the LMR-based predictions. The established machine learning model was evaluated using the methods of ten-fold cross-validation and independent external validation.
Section snippets
Indoor PM2.5 measurements
Indoor PM2.5 concentration was measured in 116 households under the Mr. and Ms. Os (Hong Kong) Cohort Study. Fig. 1(a) shows the distribution of households sampled in the campaign. The sampling campaign was carried out in two phases. In the first phase, 55 households participated in the summer session from July 4, 2016 to September 29, 2016. Eight households withdrew after the summer session and therefore only 47 households participated in the winter session from November 14, 2016 to March 6,
The land-use regression model
The established land-use regression model achieved remarkable predictive accuracy, with training R2 and leave-one-out cross-validation R2 values of 0.79 and 0.64, respectively. Fig. 2 shows the spatial distribution of land-use regression estimated PM2.5 concentration at residential households in summer and winter seasons of the two sampling phases, respectively. On average, the outdoor PM2.5 concentration was larger in winter than in summer (Fig. 2). The land-use regression PM2.5 estimates had
Conclusions
A sampling campaign was conducted to measure indoor PM2.5 concentration in households in Hong Kong, with the collection of detailed information about indoor emission sources and ventilation settings. Indoor PM2.5 prediction models were established using the combined land-use regression and random forest machine learning approach. The training, cross-validation and external validation R2 values confirm that the established random forest model had a higher predictive accuracy than the
Credit author statement
Zhiyuan Li: Conceptualization, Methodology, Data curation, Formal analysis, Writing – original draft, Writing – review & editing, Visualization, Xinning Tong: Data curation, Writing – review & editing, Jason Man Wai Ho: Writing – review & editing, Timothy C.Y. Kwok: Writing – review & editing, Guanghui Dong: Writing – review & editing, Kin-Fai Ho: Conceptualization, Methodology, Funding acquisition, Resources, Supervision, Writing – review & editing, Steve Hung Lam Yim: Conceptualization,
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work is jointly funded by the Vice-Chancellor’s Discretionary Fund of The Chinese University of Hong Kong (grant no. 4930744) and Dr. Stanley Ho Medicine Development Foundation (grant no. 8305509). We would like to thank the Hong Kong Environmental Protection Department and the Hong Kong Observatory for providing air quality and meteorological data, respectively.
References (57)
- et al.
The effect of outdoor air and indoor human activity on mass concentrations of PM10, PM2.5, and PM1 in a classroom
Environ. Res.
(2005) - et al.
A methodology for predicting particle penetration factor through cracks of windows and doors for actual engineering application
Build. Environ.
(2012) - et al.
A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide
Environ. Int.
(2019) - et al.
Impact of outdoor meteorology on indoor PM10, PM2.5 and PM1 concentrations in a naturally ventilated classroom
Urban Clim
(2014) - et al.
Development and comparison of regression models and feedforward backpropagation neural network models to predict seasonal indoor PM2. 5–10 and PM2.5 concentrations in naturally ventilated schools
Atmos. Pollut. Res.
(2015) - et al.
Multivariate methods for indoor PM10 and PM2.5 modelling in naturally ventilated schools buildings
Atmos. Environ.
(2014) - et al.
Children’s exposure and dose assessment to particulate matter in Lisbon
Build. Environ.
(2020) - et al.
Importance of sample size, data type and prediction method for remote sensing-based estimations of aboveground forest biomass
Remote Sens. Environ.
(2014) - et al.
Analysing the impact of multiple stressors in aquatic biomonitoring data: a ‘cookbook’with applications in R
Sci. Total Environ.
(2016) - et al.
A multi-city air pollution population exposure study: combined use of chemical-transport and random-Forest models with dynamic population data
Sci. Total Environ.
(2020)
The air quality and health impacts of domestic trans-boundary pollution in various regions of China
Environ. Int.
Influences of ambient air PM2.5 concentration and meteorological condition on the indoor PM2.5 concentrations in a residential apartment in Beijing using a new approach
Environ. Pollut.
Infiltration of ambient PM2.5 and levels of indoor generated non-ETS PM2.5 in residences of four European cities
Atmos. Environ.
Changes in the hospitalization and ED visit rates for respiratory diseases associated with source-specific PM2.5 in New York State from 2005 to 2016
Environ. Res.
Downscaling land surface temperatures at regional scales with random forest regression
Remote Sens. Environ.
Contribution of outdoor-originating particles, indoor-emitted particles and indoor secondary organic aerosol (SOA) to residential indoor PM2.5 concentration: a model-based estimation
Build. Environ.
Investigation of the PM2.5, NO2 and O3 I/O ratios for office and school microenvironments
Environ. Res.
Determinants of indoor air concentrations of PM2.5, black smoke and NO2 in six European cities (EXPOLIS study)
Atmos. Environ.
Effects of future temperature change on PM2.5 infiltration in the Greater Boston area
Atmos. Environ.
Characterization of PM2.5 exposure concentration in transport microenvironments using portable monitors
Environ. Pollut.
A feasible experimental framework for field calibration of portable light-scattering aerosol monitors: case of TSI DustTrak
Environ. Pollut.
High temporal resolution prediction of street-level PM2.5 and NOx concentrations using machine learning approach
J. Clean. Prod.
Fine particulate air pollution and hospital admissions and readmissions for acute myocardial infarction in 26 Chinese cities
Chemosphere
Determinants of indoor and personal exposure to PM2.5 of indoor and outdoor origin during the RIOPA study
Atmos. Environ.
Exposure and health impact evaluation based on simultaneous measurement of indoor and ambient PM2.5 in Haidian, Beijing
Environ. Pollut.
Influencing factors and energy-saving control strategies for indoor fine particles in commercial office buildings in six Chinese cities
Energy Build.
Are we safe inside? Indoor air quality in relation to outdoor concentration of PM10 and PM2.5 and to characteristics of homes
Sustain. Cities Soc.
Seasonal trends of indoor fine particulate matter and its determinants in urban residences in Nanjing, China. Build
Environ. Times
Cited by (31)
Hybrid models of machine-learning and mechanistic models for indoor particulate matter concentration prediction
2024, Journal of Building EngineeringLarge-scale spatiotemporal deep learning predicting urban residential indoor PM<inf>2.5</inf> concentration
2023, Environment InternationalAchieving better indoor air quality with IoT systems for future buildings: Opportunities and challenges
2023, Science of the Total EnvironmentA novel approach for assessing the spatiotemporal trend of health risk from ambient particulate matter components: Case of Hong Kong
2022, Environmental ResearchCitation Excerpt :Owing to changes in emission sources, meteorological conditions, and other influencing factors (e.g., urban design and building morphology) (Chen et al., 2020; Li et al., 2015; Saha et al., 2020; Sanchez et al., 2018; Yim et al., 2014), the concentrations of PM and its components vary spatially and temporally within urban areas, resulting in the disparities in population PM exposure. The spatiotemporal variations in PM mass concentrations have been intensively modelled using different methods, such as atmospheric chemical transport models based simulations (Gu et al., 2020; Hou et al., 2019; Yim et al., 2010, 2013, 2015, 2019, 2013, 2019; Yim and Barrett, 2012), and land-use regression (LUR) modelling (Dirgawati et al., 2016; Huang et al., 2017; Lee et al., 2017; Li et al., 2021a, 2021b, 2021b; Sanchez et al., 2018). Epidemiological studies focusing on health effects of PM components require the detailed information on the spatiotemporal distribution of PM component concentrations (Bergen et al., 2013; de Hoogh et al., 2013; Sun et al., 2019).