Determining Informative Microbial Single Nucleotide Polymorphisms for Human Identification.
Allison J SherierAugust E WoernerBruce BudowlePublished in: Applied and environmental microbiology (2022)
The skin microbiome is a highly abundant and relatively stable source of DNA that may be utilized for human identification (HID). In this study, a set of single nucleotide polymorphisms (SNPs) with a high mean estimated Wright's fixation index (F ST ) (>0.1) and widespread abundance (found in ≥75% of samples compared) were selected from a diverse set of markers in the hidSkinPlex panel. The least absolute shrinkage and selection operator (LASSO) was used in a novel machine learning framework to generate a SNP panel and predict the human host from skin microbiome samples collected from the hand, manubrium, and foot. The framework was devised to emulate a new unknown person introduced to the algorithm and to match samples from that person against a population database. Unknown samples were classified with 96% accuracy (Matthews correlation coefficient [MCC], 0.954) in the test ( n = 225 samples) data set. A final panel of informative SNPs was determined for HID (hidSkinPlex+) using all 51 individuals sampled at three body sites in triplicate. The hidSkinPlex+ panel comprises 365 SNPs and yielded prediction accuracy for the correct host of 95% (MCC = 0.949). The accuracy of the hidSkinPlex+ panel may be somewhat overestimated due to using 26 individuals from the training data set for the selection of the final panel. However, this accuracy still provides an indication of performance when tested on new samples. IMPORTANCE One of the fundamental goals in forensic genetics is to identify the source of biological evidence. Methods for detecting human DNA have advanced and can be quite sensitive, but not all DNA samples are amenable to current methods. However, the human skin microbiome is a source of DNA with high copy numbers, and it has the potential for high discriminatory power. The hidSkinPlex panel has been used for HID; however, some aspects of it could be improved. Missing information is ambiguous, as it is unclear if marker drop-out is a by-product of a low-template sample or if the reasons for not observing a marker are biological. Such ambiguity may confound methods for HID, and as such, an improved marker set (hidSkinPlex+) was designed that is considerably smaller and more robust to drop-out (365 SNPs contained in 135 markers) yet still can be used to accurately predict the human host.