Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data.
Máté E MarosDavid CapperDavid T W JonesVolker HovestadtAndreas von DeimlingStefan M PfisterAxel BennerManuela ZucknickMartin SillPublished in: Nature protocols (2020)
DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth's penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.
Keyphrases
- machine learning
- dna methylation
- big data
- electronic health record
- genome wide
- papillary thyroid
- deep learning
- artificial intelligence
- gene expression
- squamous cell
- magnetic resonance
- contrast enhanced
- climate change
- randomized controlled trial
- copy number
- magnetic resonance imaging
- autism spectrum disorder
- squamous cell carcinoma
- lymph node metastasis
- computed tomography
- drug delivery
- cancer therapy