Interrogating Proteome-wide Cysteine Ligandabilities: Crystallography Meets Chemoproteomics Through Machine Learning.

Ruibin Liu Joseph ClaytonMingzhe ShenJana Shen

Published in: bioRxiv : the preprint server for biology (2023)

In the recent decade, targeted covalent inhibition (TCI) has become mainstream in drug discovery and an increasingly large number of cysteine-liganded X-ray structures have been deposited in the protein data bank (PDB). At the same time, a chemoproteomic technique called activity-based protein profiling (ABPP) has ushered in the efforts to map covalently ligandable sites in the entire proteome. Here we asked if the current PDB information is sufficient for developing highly predictive machine-learning (ML) models, and what such models can inform us about the divergence between the cysteine ligandabilities captured by crystallography and those determined by ABPP in cells. The tree-based and convolutional neural network (CNN) models were developed, trained on an exhaustively curated database (LigCys3D) containing over 1,000 liganded cysteines in nearly 800 proteins represented by over 10,000 X-ray structures. In the unseen tests, the tree models and CNNs gave the AUCs of about 94%; however, in the evaluation of a nonoverlapping ABPP dataset, the models gave significantly lower AUCs, especially when AlphaFold2 models were used. Our analysis suggests factors giving rise to the divergence and ways to improve the model transferability. Developing ML models as a surrogate of crystallography may further unleash the power of chemoproteomics. Our work represents a first step in the ML-led integration of big genome data, structure models, and chemoproteomic experiments to annotate the human proteome space for the next-generation drug discoveries.

Keyphrases