Login / Signup

EvoAI enables extreme compression and reconstruction of the protein sequence space.

Shuyi ZhangZiyuan MaWenjie LiYunhao ShenYunxin XuGengjiang LiuJiamin ChangZeju LiHong QinBoxue TianHaipeng GongDavid R LiuB ThuronyiChristopher A Voigt
Published in: Research square (2024)
Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here, we first establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 10 48 . The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.
Keyphrases