Login / Signup

Comparing Methods to Impute Missing Daily Ground-Level PM10 Concentrations between 2010-2017 in South Africa.

Oluwaseyi Olalekan ArowosegbeMartina S RagettliNino KünzliApolline SaucyTemitope Christina Adebayo-OjoMohamed Fareed JeebhayMohammed Aqiel DalvieKees de Hoogh
Published in: International journal of environmental research and public health (2021)
Good quality and completeness of ambient air quality monitoring data is central in supporting actions towards mitigating the impact of ambient air pollution. In South Africa, however, availability of continuous ground-level air pollution monitoring data is scarce and incomplete. To address this issue, we developed and compared different modeling approaches to impute missing daily average particulate matter (PM10) data between 2010 and 2017 using spatiotemporal predictor variables. The random forest (RF) machine learning method was used to explore the relationship between average daily PM10 concentrations and spatiotemporal predictors like meteorological, land use and source-related variables. National (8 models), provincial (32) and site-specific (44) RF models were developed to impute missing daily PM10 data. The annual national, provincial and site-specific RF cross-validation (CV) models explained on average 78%, 70% and 55% of ground-level PM10 concentrations, respectively. The spatial components of the national and provincial CV RF models explained on average 22% and 48%, while the temporal components of the national, provincial and site-specific CV RF models explained on average 78%, 68% and 57% of ground-level PM10 concentrations, respectively. This study demonstrates a feasible approach based on RF to impute missing measurement data in areas where data collection is sparse and incomplete.
Keyphrases
  • air pollution
  • particulate matter
  • electronic health record
  • lung function
  • south africa
  • big data
  • machine learning
  • quality improvement
  • physical activity
  • heavy metals
  • risk assessment
  • deep learning
  • data analysis