Nonparametric Variable Selection: The EARTH Algorithm
成果类型:
Article
署名作者:
Doksum, Kjell; Tang, Shijie; Tsui, Kam-Wah
署名单位:
University of Wisconsin System; University of Wisconsin Madison; Bristol-Myers Squibb
刊物名称:
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
ISSN/ISSBN:
0162-1459
DOI:
10.1198/016214508000000878
发表日期:
2008
页码:
1609-1620
关键词:
GOODNESS-OF-FIT
Regression trees
tests
摘要:
We consider regression experiments involving a response variable Y and a large number of predictor variables X-1,...,X-d, many of which may be irrelevant for the prediction of Y and thus must be removed before Y can be predicted from the X's. We consider two procedures that select variables by using importance scores that measure the strength of the relationship between predictor variables and a response and keep those variables whose importance scores exceed a threshold. In the first of these procedures, scores are obtained by randomly drawn subregions (tubes) of the predictor space that constrain all but one predictor and in each subregion computing a signal-to-noise ratio (efficacy) based on a nonparametric univariate regression of Y on the unconstrained variable. The subregions are adapted to boost weak variables iteratively by searching (hunting) for the subregions in which the efficacy is maximized. The efficacy can be viewed as an approximation to a one-to-one function of the probability of identifying features. By using importance scores based on averages of maximized efficacies. We develop a variable selection algorithm called EARTH (efficacy adaptive regression tube hunting) based on examining the conditional expectation of the response given all but one of the predictor variables for a collection of randomly, adaptively, and iteratively selected regions. The second importance score method (RFVS) is based on using random forest importance values to select variable. Computer simulations show that EARTH and RFVS are successful variable selection methods compared with other procedures in nonparametric situations with a large number of irrelevant predictor variables, and that when each is combined with the model selection and prediction procedure MARS, the tree-based prediction procedure GUIDE, and the random forest method, the combinations lead to improved prediction accuracy for certain models with many irrelevant variables. We give conditions under which a version of the EARTH algorithm selects the correct model with probability lending to 1 as the sample size n tends to infinity even if d -> infinity as n -> infinity. We include the analysis of a real data set in which we show how a training set can be used to find a threshold for the EARTH importance scores.