Exact and efficient phylodynamic simulation from arbitrarily large populations

成果类型:
Article
署名作者:
Celentano, Michael; Dewitt, William S.; Prillo, Sebastian; Song, Yun S.
署名单位:
University of Washington; University of Washington Seattle; University of California System; University of California Berkeley; University of California System; University of California Berkeley
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-8477
DOI:
10.1073/pnas.2412978122
发表日期:
2025-05-20
关键词:
seasonal influenza extinction rates speciation diversification phylogenies time
摘要:
Many biological studies involve inferring the evolutionary history of a sample of individuals from a large population and interpreting the reconstructed tree. Such an ascertained tree typically represents only a small part of a comprehensive population tree and is distorted by survivorship and sampling biases. Inferring evolutionary parameters from ascertained trees requires modeling both the underlying population dynamics and the ascertainment process. A crucial component of this phylodynamic modeling involves tree simulation, which is used to benchmark probabilistic inference methods. To simulate an ascertained tree, one must first simulate the full population tree and then prune unobserved lineages. Consequently, the computational cost is determined not by the size of the final simulated tree, but by the size of the population tree in which it is embedded. In most biological scenarios, simulations of the entire population are prohibitively expensive due to computational demands placed on lineages without sampled descendants. Here, we address this challenge by proving that, for any partially ascertained process from a general multitype birth-death-mutation-sampling model, there exists an equivalent process with complete sampling and no death, a property which we leverage to develop a highly efficient algorithm for simulating trees. Our algorithm scales linearly with the size of the final simulated tree and is independent of the population size, enabling simulations from extremely large populations beyond the reach of current methods but essential for various biological applications. We anticipate that this massive speedup will significantly advance the development of novel inference methods that require extensive training data.
来源URL: