Machine Learning Model for Screening Thyroid Stimulating Hormone Receptor Agonists Based on Updated Datasets and Improved Applicability Domain Metrics.
Wenjia LiuZhongyu WangJingwen ChenWeihao TangHaobo WangPublished in: Chemical research in toxicology (2023)
Machine learning (ML) models for screening endocrine-disrupting chemicals (EDCs), such as thyroid stimulating hormone receptor (TSHR) agonists, are essential for sound management of chemicals. Previous models for screening TSHR agonists were built on imbalanced datasets and lacked applicability domain (AD) characterization essential for regulatory application. Herein, an updated TSHR agonist dataset was built, for which the ratio of active to inactive compounds greatly increased to 1:2.6, and chemical spaces of structure-activity landscapes (SALs) were enhanced. Resulting models based on 7 molecular representations and 4 ML algorithms were proven to outperform previous ones. Weighted similarity density (ρ s ) and weighted inconsistency of activities ( I A ) were proposed to characterize the SALs, and a state-of-the-art AD characterization methodology AD SAL {ρ s , I A } was established. An optimal classifier developed with PubChem fingerprints and the random forest algorithm, coupled with AD SAL {ρ s ≥ 0.15, I A ≤ 0.65}, exhibited good performance on the validation set with the area under the receiver operating characteristic curve being 0.984 and balanced accuracy being 0.941 and identified 90 TSHR agonist classes that could not be found previously. The classifier together with the AD SAL {ρ s , I A } may serve as efficient tools for screening EDCs, and the AD characterization methodology may be applied to other ML models.