Login / Signup

ALipSol: An Attention-Driven Mixture-of-Experts Model for Lipophilicity and Solubility Prediction.

Jialu WuJunmei WangZhenxing WuShengyu ZhangYafeng DengYu KangDong-Sheng CaoChang-Yu HsiehTing-Jun Hou
Published in: Journal of chemical information and modeling (2022)
Lipophilicity (log D ) and aqueous solubility (log S w ) play a central role in drug development. The accurate prediction of these properties remains to be solved due to data scarcity. Current methodologies neglect the intrinsic relationships between physicochemical properties and usually ignore the ionization effects. Here, we propose an attention-driven mixture-of-experts (MoE) model named ALipSol, which explicitly reproduces the hierarchy of task relationships. We adopt the principle of divide-and-conquer by breaking down the complex end point (log D or log S w ) into simpler ones (acidic p K a , basic p K a , and log P ) and allocating a specific expert network for each subproblem. Subsequently, we implement transfer learning to extract knowledge from related tasks, thus alleviating the dilemma of limited data. Additionally, we substitute the gating network with an attention mechanism to better capture the dynamic task relationships on a per-example basis. We adopt local fine-tuning and consensus prediction to further boost model performance. Extensive evaluation experiments verify the success of the ALipSol model, which achieves RMSE improvement of 8.04%, 2.49%, 8.57%, 12.8%, and 8.60% on the Lipop, ESOL, AqSolDB, external log D , and external log S data sets, respectively, compared with Attentive FP and the state-of-the-art in silico tools. In particular, our model yields more significant advantages (Welch's t -test) for small training data, implying its high robustness and generalizability. The interpretability analysis proves that the atom contributions learned by ALipSol are more reasonable compared with the vanilla Attentive FP, and the substitution effects in benzene derivatives agreed well with empirical constants, revealing the potential of our model to extract useful patterns from data and provide guidance for lead optimization.
Keyphrases
  • electronic health record
  • working memory
  • big data
  • oxidative stress
  • high resolution
  • mass spectrometry
  • clinical practice
  • drug induced
  • air pollution
  • molecular docking