Unsupervised Machine Learning Organization of the Functional Dark Proteome of Gram-Negative "Superbugs": Six Protein Clusters Amenable for Distinct Scientific Applications.
Carlos SiciliaAndrés Corral-LugoPawel SmialowskiMichael J McConnellAntonio Javier Martín-GalianoPublished in: ACS omega (2022)
Uncharacterized proteins have been underutilized as targets for the development of novel therapeutics for difficult-to-treat bacterial infections. To facilitate the exploration of these proteins, 2819 predicted, uncharacterized proteins (19.1% of the total) from reference strains of multidrug Acinetobacter baumannii , Klebsiella pneumoniae , and Pseudomonas aeruginosa species were organized using an unsupervised k-means machine learning algorithm. Classification using normalized values for protein length, pI, hydrophobicity, degree of conservation, structural disorder, and %AT of the coding gene rendered six natural clusters. Cluster proteins showed different trends regarding operon membership, expression, presence of unknown function domains, and interactomic relevance. Clusters 2, 4, and 5 were enriched with highly disordered proteins, nonworkable membrane proteins, and likely spurious proteins, respectively. Clusters 1, 3, and 6 showed closer distances to known antigens, antibiotic targets, and virulence factors. Up to 21.8% of proteins in these clusters were structurally covered by modeling, which allowed assessment of druggability and discontinuous B-cell epitopes. Five proteins (4 in Cluster 1) were potential druggable targets for antibiotherapy. Eighteen proteins (11 in Cluster 6) were strong B-cell and T-cell immunogen candidates for vaccine development. Conclusively, we provide a feature-based schema to fractionate the functional dark proteome of critical pathogens for fundamental and biomedical purposes.
Keyphrases
- machine learning
- pseudomonas aeruginosa
- multidrug resistant
- gram negative
- acinetobacter baumannii
- escherichia coli
- klebsiella pneumoniae
- drug resistant
- artificial intelligence
- cystic fibrosis
- immune response
- small molecule
- dendritic cells
- long non coding rna
- risk assessment
- copy number
- biofilm formation
- big data
- genome wide analysis