Sample-weighted semiparametric estimation of cause-specific cumulative risk and incidence using left- or interval-censored data from electronic health records.

Noorie Hyun Hormuzd A KatkiBarry I Graubard

Published in: Statistics in medicine (2020)

Electronic health records (EHRs) can be a cost-effective data source for forming cohorts and developing risk models in the context of disease screening. However, important issues need to be handled: competing outcomes, left-censoring of prevalent disease, interval-censoring of incident disease, and uncertainty of prevalent disease when accurate disease ascertainment is not conducted at baseline. Furthermore, novel tests that are costly and limited in availability can be conducted on stored biospecimens selected as samples from EHRs by using different sampling fractions. We extend sample-weighted semiparametric marginal mixture models to estimating competing risks. For flexible modeling of relative risks, a general transformation of the subdistribution hazard function and regression parameters is used. We propose a numerical algorithm for nonparametrically calculating the maximum likelihood estimates for subdistribution hazard functions and regression parameters. Methods for calculating the consistent confidence intervals for relative and absolute risk estimates are presented. The proposed algorithm and methods show reliable finite sample performance through simulation studies. We apply our methods to a cohort assembled from EHRs at a health maintenance organization where we estimate cumulative risk of cervical precancer/cancer and incidence of infection-clearance by HPV genotype among human papillomavirus (HPV) positive women. There is no significant difference in 3-year HPV-clearance rates across different HPV types, but 3-year cumulative risk of progression-to-precancer/cancer from HPV-16 is relatively higher than the other HPV genotypes.

Keyphrases