Language models reveal a complex sequence basis for adaptive convergent evolution of protein functions
成果类型:
Article
署名作者:
Cao, Zhenqiu; Zhang, Hongjiu; Zou, Zhengting
署名单位:
Chinese Academy of Sciences; Institute of Zoology, CAS; Chinese Academy of Sciences; University of Chinese Academy of Sciences, CAS
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-9375
DOI:
10.1073/pnas.2418254122
发表日期:
2025-09-30
关键词:
hearing gene prestin
adaptation
prediction
signatures
bacterial
摘要:
Convergent evolution, or convergence, refers to repeated, independent emergences of the same trait in two or more lineages of species during evolution, often indicating functional adaptation to specific environmental factors. Many computational methods have been proposed to investigate the genetic basis for organismal functional convergence, as an important way to decode the complex sequence-function map of proteins. These methods mostly focus on the convergence of amino acid states at the level of individual sites in functionally related proteins. However, even without site-level sequence similarity, protein function similarity may also stem from convergence of high-order protein features, which cannot be captured by the conventional methods. To fill this gap, we first derived numerical embeddings from protein sequences by pretrained protein language models (PLM). In four previously reported cases, we found that functionally convergent proteins have similar embeddings despite no site-level convergence, indicating that PLM embeddings can reflect convergence of high-order protein features. We then designed a pipeline to detect Adaptive Convergence by Embedding of Protein (ACEP). ACEP tests were significant on known and additional candidate genes with putative adaptive convergence like echolocation and crassulacean acid metabolism. Genome-wide application showed that the ACEP framework can effectively enrich such candidates. Relations between convergences of PLM embeddings and specific protein physicochemical features were further examined. In conclusion, PLM embeddings can indicate adaptive convergence of high-order protein features beyond site identities, demonstrating the power of deep learning tools for investigating the complex mapping between molecular sequences and functions.
来源URL: