Login / Signup

Clustering protein functional families at large scale with hierarchical approaches.

Nicola BordinHarry ScholesClemens RauerJoel Roca-MartinezIan SillitoeChristine Orengo
Published in: Protein science : a publication of the Protein Society (2024)
Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
Keyphrases
  • protein protein
  • amino acid
  • binding protein
  • machine learning
  • single cell
  • big data
  • transcription factor
  • artificial intelligence
  • case control
  • genetic diversity