IMPUTATION AND POST-SELECTION INFERENCE IN MODELS WITH MISSING DATA: AN APPLICATION TO COLORECTAL CANCER SURVEILLANCE GUIDELINES
成果类型:
Article
署名作者:
Liu, Lin; Qiu, Yuqi; Natarajan, Loki; Messer, Karen
署名单位:
University of California System; University of California San Diego
刊物名称:
ANNALS OF APPLIED STATISTICS
ISSN/ISSBN:
1932-6157
DOI:
10.1214/19-AOAS1239
发表日期:
2019
页码:
1370-1396
关键词:
VARIABLE SELECTION
RISK
US
摘要:
It is common to encounter missing data among the potential predictor variables in the setting of model selection. For example, in a recent study we attempted to improve the US guidelines for risk stratification after screening colonoscopy (Cancer Causes Control 27 (2016) 1175-1185), with the aim to help reduce both overuse and underuse of follow-on surveillance colonoscopy. The goal was to incorporate selected additional informative variables into a neoplasia risk-prediction model, going beyond the three currently established risk factors, using a large dataset pooled from seven different prospective studies in North America. Unfortunately, not all candidate variables were collected in all studies, so that one or more important potential predictors were missing on over half of the subjects. Thus, while variable selection was a main focus of the study, it was necessary to address the substantial amount of missing data. Multiple imputation can effectively address missing data, and there are also good approaches to incorporate the variable selection process into model-based confidence intervals. However, there is not consensus on appropriate methods of inference which address both issues simultaneously. Our goal here is to study the properties of model-based confidence intervals in the setting of imputation for missing data followed by variable selection. We use both simulation and theory to compare three approaches to such post-imputation-selection inference: a multiple-imputation approach based on Rubin's Rules for variance estimation (Comput. Statist. Data Anal. 71 (2014) 758-770); a single imputation-selection followed by bootstrap percentile confidence intervals; and a new bootstrap model-averaging approach presented here, following Efron (J. Amer. Statist. Assoc. 109 (2014) 991-1007). We investigate relative strengths and weaknesses of each method. The Rubin's Rules multiple imputation estimator can have severe undercoverage, and is not recommended. The imputation-selection estimator with bootstrap percentile confidence intervals works well. The bootstrap-model-averaged estimator, with the Efron's Rules estimated variance, may be preferred if the true effect sizes are moderate. We apply these results to the colorectal neoplasia risk-prediction problem which motivated the present work.