Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model.
JingJing ZhaiAaron GokaslanYair SchiffAna BerthelZong-Yan LiuWei-Yun LaiZachary R MillerArmin SchebenMichelle C StitzerMaria Cinta RomayEdward S BucklerVolodymyr KuleshovPublished in: bioRxiv : the preprint server for biology (2024)
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks, including predicting translation initiation/termination sites and splice donor and acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming the best existing DNA LM by 1.45 to 7.23-fold. PlantCaduceus is competitive to state-of-the-art protein LMs in terms of deleterious mutation identification, and is threefold better than PhyloP. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.
Keyphrases
- circulating tumor
- single molecule
- cell free
- cell wall
- resistance training
- transcription factor
- air pollution
- autism spectrum disorder
- genome wide
- plant growth
- electronic health record
- nucleic acid
- body composition
- big data
- pet imaging
- machine learning
- physical activity
- working memory
- genetic diversity
- dna methylation
- artificial intelligence
- single cell
- high intensity
- deep learning
- quantum dots