Login / Signup

Leveraging Language Model Multitasking To Predict C-H Borylation Selectivity.

Ruslan KotlyarovKonstantinos PapachristosGeoffrey P F WoodJonathan M Goodman
Published in: Journal of chemical information and modeling (2024)
C-H borylation is a high-value transformation in the synthesis of lead candidates for the pharmaceutical industry because a wide array of downstream coupling reactions is available. However, predicting its regioselectivity, especially in drug-like molecules that may contain multiple heterocycles, is not a trivial task. Using a data set of borylation reactions from Reaxys, we explored how a language model originally trained on USPTO_500_MT, a broad-scope set of patent data, can be used to predict the C-H borylation reaction product in different modes: product generation and site reactivity classification. Our fine-tuned T5Chem multitask language model can generate the correct product in 79% of cases. It can also classify the reactive aromatic C-H bonds with 95% accuracy and 88% positive predictive value, exceeding purpose-developed graph-based neural networks.
Keyphrases
  • neural network
  • autism spectrum disorder
  • machine learning
  • electronic health record
  • big data
  • air pollution
  • data analysis