Login / Signup

Examining evolutionary scale modeling-derived different-dimensional embeddings in the antimicrobial peptide classification through a KNIME workflow.

Karla L Martínez-MauricioCésar R García-JacasGreneter Cordoves-Delgado
Published in: Protein science : a publication of the Protein Society (2024)
Molecular features play an important role in different bio-chem-informatics tasks, such as the Quantitative Structure-Activity Relationships (QSAR) modeling. Several pre-trained models have been recently created to be used in downstream tasks, either by fine-tuning a specific model or by extracting features to feed traditional classifiers. In this regard, a new family of Evolutionary Scale Modeling models (termed as ESM-2 models) was recently introduced, demonstrating outstanding results in protein structure prediction benchmarks. Herein, we studied the usefulness of the different-dimensional embeddings derived from the ESM-2 models to classify antimicrobial peptides (AMPs). To this end, we built a KNIME workflow to use the same modeling methodology across experiments in order to guarantee fair analyses. As a result, the 640- and 1280-dimensional embeddings derived from the 30- and 33-layer ESM-2 models, respectively, are the most valuable  since statistically better performances were achieved by the QSAR models built from them. We also fused features of the different ESM-2 models, and it was concluded that the fusion contributes to getting better QSAR models than using features of a single ESM-2 model. Frequency studies revealed that only a portion of the ESM-2 embeddings is valuable for modeling tasks since between 43% and 66% of the features were never used. Comparisons regarding state-of-the-art deep learning (DL) models confirm that when performing methodologically principled studies in the prediction of AMPs, non-DL based QSAR models yield comparable-to-superior performances to DL-based QSAR models. The developed KNIME workflow is available-freely at https://github.com/cicese-biocom/classification-QSAR-bioKom. This workflow can be valuable to avoid unfair comparisons regarding new computational methods, as well as to propose new non-DL based QSAR models.
Keyphrases
  • molecular docking
  • deep learning
  • gene expression
  • genome wide
  • working memory
  • mass spectrometry
  • structure activity relationship
  • high intensity
  • binding protein