Distribution of runs and longest runs: A new generating function approach

成果类型:
Article
署名作者:
Kong, Yong
署名单位:
National University of Singapore; National University of Singapore
刊物名称:
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
ISSN/ISSBN:
0162-1459
DOI:
10.1198/016214505000001401
发表日期:
2006
页码:
1253-1263
关键词:
multiple runs sample SEQUENCES alphabet tests
摘要:
Exact distributions of run statistics are traditionally obtained using combinatorial methods, which, under certain situations, become very tedious. Run distributions of multiple object systems, although appearing frequently in applications from various fields, such as computational biology, are not commonly used, due in part to the lack of easy-to-use formulas. In this article, a method for evaluating partition functions of lattice models in the field of statistical mechanics is used to develop a systematic method to study various run statistics in multiple object systems. By using particular generating functions for the specified situation under study, many new distributions can be obtained in a unified and coherent way. The method makes it possible to manipulate formulas of run statistics by using binomial identities to obtain more general, yet simpler formulas. To illustrate the applications of the general method, the distributions of the total number of runs and the longest runs are investigated. Novel and general explicit formulas are derived for the distribution and moments of the total number of runs, and simple explicit formulas are derived for the distributions of the longest runs. In addition, some classical run statistics are recovered and generalized in the same unified way. As examples of applications to biological sequence analysis, the run statistics developed using the general method are applied to several protein sequences to examine their global and local features.