Login / Signup

RandomForestsGLS: An R package for Random Forests for dependent data.

Arkajyoti SahaSumanta BasuAbhirup Datta
Published in: Journal of open source software (2022)
With the modern advances in geographical information systems, remote sensing technologies, and low-cost sensors, we are increasingly encountering datasets where we need to account for spatial or serial dependence. Dependent observations ( y 1 , y 2 , …, y n ) with covariates (x 1 , ..., x n ) can be modeled non-parametrically as y i = m (x i ) + ϵ i , where m (x i ) is mean component and ∈ i accounts for the dependency in data. We assume that dependence is captured through a covariance function of the correlated stochastic process ∈ i (second order dependence). The correlation is typically a function of "spatial distance" or "time-lag" between two observations. Unlike linear regression, non-linear Machine Learning (ML) methods for estimating the regression function m can capture complex interactions among the variables. However, they often fail to account for the dependence structure, resulting in sub-optimal estimation. On the other hand, specialized software for spatial/temporal data properly models data correlation but lacks flexibility in modeling the mean function m by only focusing on linear models. RandomForestsGLS bridges the gap through a novel rendition of Random Forests (RF) - namely, RF-GLS - by explicitly modeling the spatial/serial data correlation in the RF fitting procedure to substantially improve the estimation of the mean function. Additionally, RandomForestsGLS leverages kriging to perform predictions at new locations for geo-spatial data.
Keyphrases
  • electronic health record
  • big data
  • machine learning
  • low cost
  • climate change
  • healthcare
  • data analysis
  • deep learning
  • rna seq