Regression-Based Active Learning for Accessible Acceleration of Ultra-Large Library Docking.
Egor MarinMargarita KovalevaMaria KadukovaKhalid MustafinPolina A KhornAndrey RogachevAlexey MishinAlbert GuskovValentin I BorshchevskiyPublished in: Journal of chemical information and modeling (2023)
Structure-based drug discovery is a process for both hit finding and optimization that relies on a validated three-dimensional model of a target biomolecule, used to rationalize the structure-function relationship for this particular target. An ultralarge virtual screening approach has emerged recently for rapid discovery of high-affinity hit compounds, but it requires substantial computational resources. This study shows that active learning with simple linear regression models can accelerate virtual screening, retrieving up to 90% of the top-1% of the docking hit list after docking just 10% of the ligands. The results demonstrate that it is unnecessary to use complex models, such as deep learning approaches, to predict the imprecise results of ligand docking with a low sampling depth. Furthermore, we explore active learning meta-parameters and find that constant batch size models with a simple ensembling method provide the best ligand retrieval rate. Finally, our approach is validated on the ultralarge size virtual screening data set, retrieving 70% of the top-0.05% of ligands after screening only 2% of the library. Altogether, this work provides a computationally accessible approach for accelerated virtual screening that can serve as a blueprint for the future design of low-compute agents for exploration of the chemical space via large-scale accelerated docking. With recent breakthroughs in protein structure prediction, this method can significantly increase accessibility for the academic community and aid in the rapid discovery of high-affinity hit compounds for various targets.
Keyphrases
- protein protein
- small molecule
- molecular dynamics
- molecular dynamics simulations
- drug discovery
- deep learning
- mental health
- healthcare
- high throughput
- machine learning
- optical coherence tomography
- high resolution
- artificial intelligence
- loop mediated isothermal amplification
- convolutional neural network
- single cell
- data analysis
- medical students
- quantum dots
- anaerobic digestion