S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure.

Duolin Wang Usman L AbbasQing ShaoJin ChenDong Xu

Published in: bioRxiv : the preprint server for biology (2023)

Large protein language models (PLMs) have presented excellent potential to reshape protein research. The trained PLMs encode the amino acid sequence of a protein to a mathematical embedding that can be used for protein design or property prediction. It is recognized that protein 3D structure plays an important role in protein properties and functions. However, most PLMs are trained only on sequence data and lack protein 3D structure information. The lack of such crucial 3D structure information hampers the prediction capacity of PLMs in various applications, especially those heavily depending on the 3D structure. We utilize contrastive learning to develop a 3D structure-aware protein language model (S-PLM). The model encodes the sequence and 3D structure of proteins separately and deploys a multi-view contrastive loss function to enable the information exchange between the sequence and 3D structure embeddings. Our analysis shows that contrastive learning effectively incorporates 3D structure information into sequence-based embeddings. This implementation enhances the predictive performance of the sequence-based embedding in several downstream tasks.

Keyphrases