Login / Signup

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model.

JingJing ZhaiAaron GokaslanYair SchiffAna BerthelZong-Yan LiuWei-Yun LaiZachary R MillerArmin SchebenMichelle C StitzerMaria Cinta RomayEdward S BucklerVolodymyr Kuleshov
Published in: bioRxiv : the preprint server for biology (2024)
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks, including predicting translation initiation/termination sites and splice donor and acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming the best existing DNA LM by 1.45 to 7.23-fold. PlantCaduceus is competitive to state-of-the-art protein LMs in terms of deleterious mutation identification, and is threefold better than PhyloP. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.
Keyphrases