Elsevier

Chemosphere

Volume 265, February 2021, 129140
Chemosphere

A practical framework for predicting residential indoor PM2.5 concentration using land-use regression and machine learning methods

https://doi.org/10.1016/j.chemosphere.2020.129140Get rights and content

Highlights

  • A practical indoor PM2.5 prediction framework was proposed.

  • The combined land-use regression and machine learning approach was applied.

  • External validation was proposed to evaluate the prediction model.

  • The random forest model outperformed the linear mixed-effect regression model.

  • The outdoor PM2.5 concentration is the most important predictor variable.

Abstract

People typically spend most of their time indoors. It is of importance to establish prediction models to estimate PM2.5 concentration in indoor environments (e.g., residential households) to allow accurate assessments of exposure in epidemiological studies. This study aimed to develop models to predict PM2.5 concentration in residential households. PM2.5 concentration and related parameters (e.g., basic information about the households and ventilation settings) were collected in 116 households during the winter and summer seasons in Hong Kong. Outdoor PM2.5 concentration at households was estimated using a land-use regression model. The random forest machine learning algorithm was then applied to develop indoor PM2.5 prediction models. The results show that the random forest model achieved a promising predictive accuracy, with R2 and cross-validation R2 values of 0.93 and 0.65, respectively. Outdoor PM2.5 concentration was the most important predictor variable, followed in descending order by the household marked number, outdoor temperature, outdoor relative humidity, average household area and air conditioning. The external validation result using an independent dataset confirmed the potential application of the random forest model, with an R2 value of 0.47. Overall, this study shows the value of a combined land-use regression and machine learning approach in establishing indoor PM2.5 prediction models that provide a relatively accurate assessment of exposure for use in epidemiological studies.

Introduction

Air pollution has become a global public health issue (Kioumourtzoglou et al., 2016; Li et al., 2020a; Weber et al., 2016; Yim et al., 2015; Yim, 2020). Particulate matter with an aerodynamic diameter ≤2.5 μm (PM2.5) has been reported to be positively associated with adverse effects on respiratory and cardiovascular health (Hopke et al., 2020; Kioumourtzoglou et al., 2016; Liu et al., 2018; Weber et al., 2016), causing substantial amount of premature morality every year (Gu and Yim, 2016; Hou et al., 2019; Yim et al., 2019a). People spend an average 80%–90% of their time indoors (Ren et al., 2017; Rivas et al., 2019; Xie et al., 2020). It is of importance to determine indoor PM2.5 concentration to evaluate personal exposure to PM2.5 and protect human health (Faria et al., 2020; Han et al., 2015; Yuchi et al., 2019). Measuring indoor PM2.5 concentration with strict quality assurance/control procedures provides accurate data for air quality and public health studies (Faria et al., 2020; Han et al., 2015; Zhou et al., 2016), but is both expensive and time consuming. In addition, it is almost impossible to conduct indoor PM2.5 measurements for a large number of subjects (e.g., a few thousand people or more) or for a long study period (e.g., over years) (Tong et al., 2019; Yuchi et al., 2019).

Indoor PM2.5 originates from the infiltration of ambient PM2.5 and emissions from indoor sources (e.g., cooking and smoking) (Ji and Zhao, 2015; Kalimeri et al., 2019; Tong et al., 2018; Zhao et al., 2015). Indoor PM2.5 concentration can be therefore predicted based on the measured or modeled concentration of ambient PM2.5, the influencing factors describing the infiltration of ambient PM2.5 and a number of influencing factors describing the strength of indoor emission sources of PM2.5 (Elbayoumi et al., 2014; Tong et al., 2019). Indoor PM2.5 prediction models can be divided into two categories: physically based mechanistic models and statistical models (Wei et al., 2019). It is typically difficult to establish mechanistic models because they require detailed and complex input data, such as information on exact strength of indoor sources and sink materials for particles (Wei et al., 2019; Xie et al., 2020). Statistical models can achieve a relatively good prediction performance with smaller amounts of input data.

A number of previous studies have successfully developed statistical models to predict indoor PM2.5 concentration (Elbayoumi et al., 2014; Tong et al., 2019; Yuchi et al., 2019). Most of these studies were conducted based on the assumption that indoor PM2.5 concentration is linearly correlated with the predictor variables. These models may not fully capture the complex relationship between predictor variables and indoor PM2.5 concentration, resulting in biases of predictions against observations (Yuchi et al., 2019). In recent years, the machine learning approach has been applied to the prediction of indoor PM2.5 concentration (Elbayoumi et al., 2015; Yuchi et al., 2019). For example, Elbayoumi et al. (2015) reported that a machine learning model based on a feed-forward, back-propagation neural network outperformed a multiple linear regression (MLR) method in predicting seasonal indoor PM2.5 concentration in naturally ventilated school classrooms in the Gaza Strip. Yuchi et al. (2019) modeled indoor PM2.5 concentration in apartments in Ulaanbaatar, Mongolia using a random forest machine learning method and found that their model performed better than a conventional MLR model in terms of prediction, but had a similar performance to MLR in cross-validation. Most of these indoor PM2.5 prediction studies usually used outdoor PM2.5 measurements or nearby fixed-site station PM2.5 measurements as predictor variables (Elbayoumi et al., 2014; Yuchi et al., 2019). As a result, these established models had poor transferability to other areas because outdoor PM2.5 measurements require large efforts (Han et al., 2015; Qi et al., 2017) and the use of nearby station measurements may bring in a large bias due to the spatial variability in outdoor PM2.5 concentration (Che et al., 2018). It is necessary to apply a modelling approach to estimate outdoor PM2.5 concentration for the use of indoor predictions. In addition, reported studies have only used the cross-validation method to evaluate established prediction models, which may not be applicable due to the similarity of the datasets (Li et al., 2020b). It is therefore essential to assess the models using independent datasets.

Tong et al. (2019) used linear mixed-effect regression (LMR) to predict indoor PM2.5 concentration in 116 households in Hong Kong, with a moderate R2 value of 0.67. The main objective of the present work was to develop a residential indoor PM2.5 prediction model with the same dataset using a combined land-use regression and machine learning approach and to compare the results with the LMR-based predictions. The established machine learning model was evaluated using the methods of ten-fold cross-validation and independent external validation.

Section snippets

Indoor PM2.5 measurements

Indoor PM2.5 concentration was measured in 116 households under the Mr. and Ms. Os (Hong Kong) Cohort Study. Fig. 1(a) shows the distribution of households sampled in the campaign. The sampling campaign was carried out in two phases. In the first phase, 55 households participated in the summer session from July 4, 2016 to September 29, 2016. Eight households withdrew after the summer session and therefore only 47 households participated in the winter session from November 14, 2016 to March 6,

The land-use regression model

The established land-use regression model achieved remarkable predictive accuracy, with training R2 and leave-one-out cross-validation R2 values of 0.79 and 0.64, respectively. Fig. 2 shows the spatial distribution of land-use regression estimated PM2.5 concentration at residential households in summer and winter seasons of the two sampling phases, respectively. On average, the outdoor PM2.5 concentration was larger in winter than in summer (Fig. 2). The land-use regression PM2.5 estimates had

Conclusions

A sampling campaign was conducted to measure indoor PM2.5 concentration in households in Hong Kong, with the collection of detailed information about indoor emission sources and ventilation settings. Indoor PM2.5 prediction models were established using the combined land-use regression and random forest machine learning approach. The training, cross-validation and external validation R2 values confirm that the established random forest model had a higher predictive accuracy than the

Credit author statement

Zhiyuan Li: Conceptualization, Methodology, Data curation, Formal analysis, Writing – original draft, Writing – review & editing, Visualization, Xinning Tong: Data curation, Writing – review & editing, Jason Man Wai Ho: Writing – review & editing, Timothy C.Y. Kwok: Writing – review & editing, Guanghui Dong: Writing – review & editing, Kin-Fai Ho: Conceptualization, Methodology, Funding acquisition, Resources, Supervision, Writing – review & editing, Steve Hung Lam Yim: Conceptualization,

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is jointly funded by the Vice-Chancellor’s Discretionary Fund of The Chinese University of Hong Kong (grant no. 4930744) and Dr. Stanley Ho Medicine Development Foundation (grant no. 8305509). We would like to thank the Hong Kong Environmental Protection Department and the Hong Kong Observatory for providing air quality and meteorological data, respectively.

References (57)

  • Y. Gu et al.

    The air quality and health impacts of domestic trans-boundary pollution in various regions of China

    Environ. Int.

    (2016)
  • Y. Han et al.

    Influences of ambient air PM2.5 concentration and meteorological condition on the indoor PM2.5 concentrations in a residential apartment in Beijing using a new approach

    Environ. Pollut.

    (2015)
  • O.O. Hänninen et al.

    Infiltration of ambient PM2.5 and levels of indoor generated non-ETS PM2.5 in residences of four European cities

    Atmos. Environ.

    (2004)
  • P.K. Hopke et al.

    Changes in the hospitalization and ED visit rates for respiratory diseases associated with source-specific PM2.5 in New York State from 2005 to 2016

    Environ. Res.

    (2020)
  • C. Hutengs et al.

    Downscaling land surface temperatures at regional scales with random forest regression

    Remote Sens. Environ.

    (2016)
  • W. Ji et al.

    Contribution of outdoor-originating particles, indoor-emitted particles and indoor secondary organic aerosol (SOA) to residential indoor PM2.5 concentration: a model-based estimation

    Build. Environ.

    (2015)
  • K.K. Kalimeri et al.

    Investigation of the PM2.5, NO2 and O3 I/O ratios for office and school microenvironments

    Environ. Res.

    (2019)
  • H.K. Lai et al.

    Determinants of indoor air concentrations of PM2.5, black smoke and NO2 in six European cities (EXPOLIS study)

    Atmos. Environ.

    (2006)
  • W.C. Lee et al.

    Effects of future temperature change on PM2.5 infiltration in the Greater Boston area

    Atmos. Environ.

    (2017)
  • Z.Y. Li et al.

    Characterization of PM2.5 exposure concentration in transport microenvironments using portable monitors

    Environ. Pollut.

    (2017)
  • Z.Y. Li et al.

    A feasible experimental framework for field calibration of portable light-scattering aerosol monitors: case of TSI DustTrak

    Environ. Pollut.

    (2019)
  • Z.Y. Li et al.

    High temporal resolution prediction of street-level PM2.5 and NOx concentrations using machine learning approach

    J. Clean. Prod.

    (2020)
  • H. Liu et al.

    Fine particulate air pollution and hospital admissions and readmissions for acute myocardial infarction in 26 Chinese cities

    Chemosphere

    (2018)
  • Q.Y. Meng et al.

    Determinants of indoor and personal exposure to PM2.5 of indoor and outdoor origin during the RIOPA study

    Atmos. Environ.

    (2009)
  • M. Qi et al.

    Exposure and health impact evaluation based on simultaneous measurement of indoor and ambient PM2.5 in Haidian, Beijing

    Environ. Pollut.

    (2017)
  • J. Ren et al.

    Influencing factors and energy-saving control strategies for indoor fine particles in commercial office buildings in six Chinese cities

    Energy Build.

    (2017)
  • M. Scibor

    Are we safe inside? Indoor air quality in relation to outdoor concentration of PM10 and PM2.5 and to characteristics of homes

    Sustain. Cities Soc.

    (2019)
  • Z. Shao et al.

    Seasonal trends of indoor fine particulate matter and its determinants in urban residences in Nanjing, China. Build

    Environ. Times

    (2017)
  • Cited by (31)

    • A novel approach for assessing the spatiotemporal trend of health risk from ambient particulate matter components: Case of Hong Kong

      2022, Environmental Research
      Citation Excerpt :

      Owing to changes in emission sources, meteorological conditions, and other influencing factors (e.g., urban design and building morphology) (Chen et al., 2020; Li et al., 2015; Saha et al., 2020; Sanchez et al., 2018; Yim et al., 2014), the concentrations of PM and its components vary spatially and temporally within urban areas, resulting in the disparities in population PM exposure. The spatiotemporal variations in PM mass concentrations have been intensively modelled using different methods, such as atmospheric chemical transport models based simulations (Gu et al., 2020; Hou et al., 2019; Yim et al., 2010, 2013, 2015, 2019, 2013, 2019; Yim and Barrett, 2012), and land-use regression (LUR) modelling (Dirgawati et al., 2016; Huang et al., 2017; Lee et al., 2017; Li et al., 2021a, 2021b, 2021b; Sanchez et al., 2018). Epidemiological studies focusing on health effects of PM components require the detailed information on the spatiotemporal distribution of PM component concentrations (Bergen et al., 2013; de Hoogh et al., 2013; Sun et al., 2019).

    View all citing articles on Scopus
    View full text