Login / Signup

Individual risk prediction: Comparing random forests with Cox proportional-hazards model by a simulation study.

Valia BaralouNatasa KalpourtziGiota Touloumi
Published in: Biometrical journal. Biometrische Zeitschrift (2022)
With big data becoming widely available in healthcare, machine learning algorithms such as random forest (RF) that ignores time-to-event information and random survival forest (RSF) that handles right-censored data are used for individual risk prediction alternatively to the Cox proportional hazards (Cox-PH) model. We aimed to systematically compare RF and RSF with Cox-PH. RSF with three split criteria [log-rank (RSF-LR), log-rank score (RSF-LRS), maximally selected rank statistics (RSF-MSR)]; RF, Cox-PH, and Cox-PH with splines (Cox-S) were evaluated through a simulation study based on real data. One hundred eighty scenarios were investigated assuming different associations between the predictors and the outcome (linear/linear and interactions/nonlinear/nonlinear and interactions), training sample sizes (500/1000/5000), censoring rates (50%/75%/93%), hazard functions (increasing/decreasing/constant), and number of predictors (seven, 15 including noise variables). Methods' performance was evaluated with time-dependent area under curve and integrated Brier score. In all scenarios, RF had the worst performance. In scenarios with a low number of events (⩽70), Cox-PH was at least noninferior to RSF, whereas under linearity assumption it outperformed RSF. Under the presence of interactions, RSF performed better than Cox-PH as the number of events increased whereas Cox-S reached at least similar performance with RSF under nonlinear effects. RSF-LRS performed slightly worse than RSF-LR and RSF-MSR when including noise variables and interaction effects. When applied to real data, models incorporating survival time performed better. Although RSF algorithms are a promising alternative to conventional Cox-PH as data complexity increases, they require a higher number of events for training. In time-to-event analysis, algorithms that consider survival time should be used.
Keyphrases
  • machine learning
  • big data
  • climate change
  • healthcare
  • artificial intelligence
  • electronic health record
  • deep learning
  • data analysis
  • health insurance
  • virtual reality