Predicting Groundwater PFOA Exposure Risks with Bayesian Networks: Empirical Impact of Data Preprocessing on Model Performance.
Runwei LiJacqueline MacDonald GibsonPublished in: Environmental science & technology (2023)
The plethora of data on PFASs in environmental samples collected in response to growing concern about these chemicals could enable the training of machine-learning models for predicting exposure risks. However, differences in sampling and analysis methods across data sets must be reconciled through data preprocessing, and little information is available about how such manipulations affect the resulting models. This study evaluates how data preprocessing influences machine-learned Bayesian network models of PFOA in groundwater. We link 19 years of PFOA measurements from Minnesota, USA, to publicly available information about potential PFOA sources and factors that may influence their environmental fate. Nine different preprocessing methods were tested, and the resulting data sets were used to train models to predict the probability of PFOA ≥ 35 ppt, the 2017 Minnesota health advisory level. Different preprocessing approaches produced varying model structures with significantly different accuracies. Nonetheless, models showed similar relationships between predictor variables and PFOA exposure risks, and all models were relatively accurate, distinguishing wells at high risk from those at low risk for 82.0% to 89.0% of test data samples. There was a trade-off between data quality and model performance since a stricter data screening strategy decreased the sample size for model training.