Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model
成果类型:
Article
署名作者:
Zhai, Jingjing; Gokaslan, Aaron; Schiff, Yair; Berthel, Ana; Liu, Zong - Yan; Lai, Wei - Yun; Miller, Zachary R.; Scheben, Armin; Stitzer, Michelle C.; Romay, M. Cinta; Buckler, Edward S.; Kuleshov, Volodymyr
署名单位:
Cornell University; Cornell University; Cornell University; Cold Spring Harbor Laboratory; United States Department of Agriculture (USDA)
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-9188
DOI:
10.1073/pnas.2421738122
发表日期:
2025-06-17
关键词:
deleterious mutations
摘要:
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pretrained on large-scale biological sequences can capture evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM that learns evolutionary conservation patterns in 16 angiosperm genomes by modeling both DNA strands simultaneously. When fine-tuned on a small set of labeled Arabidopsis data for tasks such as predicting translation initiation/termination sites and splice donor/acceptor sites, PlantCaduceus demonstrated remarkable transferability to maize, which diverged 160 Mya. The model outperformed the best existing DNA language model by 1.45-fold in maize splice donor prediction and 7.23-fold in maize translation initiation site prediction. In variant effect prediction, PlantCaduceus showed performance comparative to state-of-the-art protein LMs. Mutations predicted to be deleterious by PlantCaduceus showed threefold lower average minor allele frequencies compared to those identified by multiple sequence alignment-based methods. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.
来源URL: