您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 统计学 > Journal of the American Statistical Association > 2008 > 484期

Nonparametric Variable Selection: The EARTH Algorithm

成果类型：

Article

署名作者：

Doksum, Kjell; Tang, Shijie; Tsui, Kam-Wah

署名单位：

University of Wisconsin System; University of Wisconsin Madison; Bristol-Myers Squibb

刊物名称：

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION

ISSN/ISSBN：

0162-1459

DOI：

10.1198/016214508000000878

发表日期：

2008

页码：

1609-1620

关键词：

GOODNESS-OF-FIT Regression trees tests

摘要：

We consider regression experiments involving a response variable Y and a large number of predictor variables X-1,...,X-d, many of which may be irrelevant for the prediction of Y and thus must be removed before Y can be predicted from the X's. We consider two procedures that select variables by using importance scores that measure the strength of the relationship between predictor variables and a response and keep those variables whose importance scores exceed a threshold. In the first of these procedures, scores are obtained by randomly drawn subregions (tubes) of the predictor space that constrain all but one predictor and in each subregion computing a signal-to-noise ratio (efficacy) based on a nonparametric univariate regression of Y on the unconstrained variable. The subregions are adapted to boost weak variables iteratively by searching (hunting) for the subregions in which the efficacy is maximized. The efficacy can be viewed as an approximation to a one-to-one function of the probability of identifying features. By using importance scores based on averages of maximized efficacies. We develop a variable selection algorithm called EARTH (efficacy adaptive regression tube hunting) based on examining the conditional expectation of the response given all but one of the predictor variables for a collection of randomly, adaptively, and iteratively selected regions. The second importance score method (RFVS) is based on using random forest importance values to select variable. Computer simulations show that EARTH and RFVS are successful variable selection methods compared with other procedures in nonparametric situations with a large number of irrelevant predictor variables, and that when each is combined with the model selection and prediction procedure MARS, the tree-based prediction procedure GUIDE, and the random forest method, the combinations lead to improved prediction accuracy for certain models with many irrelevant variables. We give conditions under which a version of the EARTH algorithm selects the correct model with probability lending to 1 as the sample size n tends to infinity even if d -> infinity as n -> infinity. We include the analysis of a real data set in which we show how a training set can be used to find a threshold for the EARTH importance scores.