Login / Signup

mitoDataclean: A machine learning approach for the accurate identification of cross-contamination-derived tumor mitochondrial DNA mutations.

Liping SuShanshan GuoWenjie GuoXiaoying JiYang LiuHuanqin ZhangQichao HuangKaixiang ZhouXu GuoXiwen GuJinliang Xing
Published in: International journal of cancer (2022)
Next-generation sequencing (NGS) of mitochondrial DNA (mtDNA) has widespread applications in aging and cancer studies. However, cross-contamination of mtDNA constitutes a major concern. Previous methods for the detection of mtDNA contamination mainly focus on haplogroup-level phylogeny, but neglect haplotype-level differences, leading to limited sensitivity and accuracy. In our study, we present mitoDataclean, a random-forest-based machine learning package for accurate identification of cross-contamination, evaluation of contamination levels and detection of contamination-derived variants in mtDNA NGS data. Comprehensive optimization of mitoDataclean revealed that training simulation with mixtures of small haplogroup distance and low polymorphic difference was critical for optimal modeling. Compared to existing methods, mitoDataclean exhibited significantly improved sensitivity and accuracy for the detection of sample contamination in simulated data. In addition, mitoDataclean achieved area under the curve values of 0.91 and 0.97 for discerning genuine and contamination-derived mtDNA variants in a simulated Western dataset and private sequencing contamination data, respectively, suggesting that this tool may be applicable for different populations and samples with different sources of contamination. Finally, mitoDataclean was further evaluated in several private and public datasets and showed a robust ability for contamination detection. Altogether, our study demonstrates that mitoDataclean may be used for accurate detection of contaminated samples and contamination-derived variants in mtDNA NGS data.
Keyphrases