Automating the interpretation of PM2.5 time-resolved measurements using a data-driven approach.
Hao TangWanyu Rengie ChanMichael D SohnPublished in: Indoor air (2020)
The rapid development of automated measurement equipment enables researchers to collect greater quantities of time-resolved data from indoor and outdoor environments. While significant, the interpretation of the resulting data can be a time-consuming effort. This paper introduces an automated process of interpreting PM2.5 time-resolved data and differentiating PM2.5 emissions resulting from indoor and outdoor sources. We use Random Forest (RF), a machine learning approach, to study a dataset of 836 indoor emission events that occurred over a 2-week period in 18 apartments in California. In this paper, we show model development and evaluate its performance as the sample size and source vary. We discuss the characteristics of the dataset that tended to help the source identification and why. For example, we show that data from many events and from different apartments are essential for the model to be suitable for analyzing a new separate dataset. We also show that longitudinal data appear to be more helpful than the time frequency of measurements within a given apartment. We use the resulting RF model to analyze PM2.5 data of an entirely separate dataset collected from 65 new homes in California. The RF model identifies 442 indoor emission events, with only a few misidentifications.