Boosting the Accuracy and Chemical Space Coverage of the Detection of Small Colloidal Aggregating Molecules Using the BAD Molecule Filter.
Abdallah Abou HajalRichard A BryceBoulbaba Ben AmorNoor AtatrehMohammad A GhattasPublished in: Journal of chemical information and modeling (2024)
The ability to conduct effective high throughput screening (HTS) campaigns in drug discovery is often hampered by the detection of false positives in these assays due to small colloidally aggregating molecules (SCAMs). SCAMs can produce artifactual hits in HTS by nonspecific inhibition of the protein target. In this work, we present a new computational prediction tool for detecting SCAMs based on their 2D chemical structure. The tool, called the boosted aggregation detection (BAD) molecule filter, employs decision tree ensemble methods, namely, the CatBoost classifier and the light gradient-boosting machine, to significantly improve the detection of SCAMs. In developing the filter, we explore models trained on individual data sets, a consensus approach using these models, and, third, a merged data set approach, each tailored for specific drug discovery needs. The individual data set method emerged as most effective, achieving 93% sensitivity and 90% specificity, outperforming existing state-of-the-art models by 20 and 5%, respectively. The consensus models offer broader chemical space coverage, exceeding 90% for all testing sets. This feature is an important aspect particularly for early stage medicinal chemistry projects, and provides information on applicability domain. Meanwhile, the merged data set models demonstrated robust performance, with a notable sensitivity of 79% in the comprehensive 10-fold cross-validation test set. A SHAP analysis of model features indicates the importance of hydrophobicity and molecular complexity as primary factors influencing the aggregation propensity. The BAD molecule filter is readily accessible for the public usage on https://molmodlab-aau.com/Tools.html. This filter provides a new, more robust tool for aggregate prediction in the early stages of drug discovery to optimize hit rates and reduce associated testing and validation overheads.
Keyphrases
- drug discovery
- electronic health record
- loop mediated isothermal amplification
- early stage
- big data
- label free
- real time pcr
- healthcare
- mental health
- machine learning
- emergency department
- quality improvement
- data analysis
- smoking cessation
- lymph node
- high throughput
- radiation therapy
- resistance training
- health information
- body composition
- locally advanced
- protein protein