GENERALIZED PEARSON-FISHER CHI-SQUARE GOODNESS-OF-FIT TESTS, WITH APPLICATIONS TO MODELS WITH LIFE-HISTORY DATA
成果类型:
Article
署名作者:
LI, G; DOSS, H
署名单位:
Purdue University System; Purdue University; State University System of Florida; Florida State University
刊物名称:
ANNALS OF STATISTICS
ISSN/ISSBN:
0090-5364
DOI:
10.1214/aos/1176349151
发表日期:
1993
页码:
772-797
关键词:
randomly censored-data
truncated data
LARGE-SAMPLE
摘要:
Suppose that X1,...,X(n) are i.i.d. approximately F, and we wish to test the null hypothesis that F is a member of the parametric family F = {F(theta)(x); theta is-an-element-of THETA} where THETA is-an-element-of R(q). The classical Pearson-Fisher chi-square test involves partitioning the real axis into k cells I1,...,I(k) and forming the chi-square statistic X2 = SIGMA(i=1)k(O(i)-nF(theta)(I(i)))2/nF(theta)(I(i)), where O(i) is the number of observations falling into cell i and theta is the value of theta minimizing SIGMA(i=1)k(O(i)-nF(theta)(I(i)))2/nF(theta)(I(i)). We obtain a generalization of this test to any situation for which there is available a nonparametric estimator F of F for which n1/2(F-F)-->d W, where W is a continuous zero mean Gaussian process satisfying a mild regularity condition. We allow the cells to be data dependent. Essentially, we estimate theta by the value theta that minimizes a ''distance'' between the vectors (F(I1),...,F(I(k))) and (F(theta)(I1),...,F(theta)(I(k))), where distance is measured through an arbitrary positive definite quadratic form, and then form a chi-square type test statistic based on the difference between (F(I1),...,F(I(k))) and (F(theta)(I1),...,F(theta)(I(k))). We prove that this test statistic has asymptotically a chi-square distribution with k-q-1 degrees of freedom, and point out some errors in the literature on chi-square tests in survival analysis. Our procedure is very general and applies to a number of well-known models in survival analysis, such as right censoring and left truncation. We apply our method to deal with questions of model selection in the problem of estimating the distribution of the length of the incubation period of the AIDS virus using the CDC's data on blood-transfusion related AIDS. Our analysis suggests some models that seem to fit better than those used in the literature.