High-dimensional semi-supervised learning: in search of optimal inference of the mean

成果类型:
Article
署名作者:
Zhang, Yuqian; Bradic, Jelena
署名单位:
Renmin University of China; University of California System; University of California San Diego
刊物名称:
BIOMETRIKA
ISSN/ISSBN:
0006-3444
DOI:
10.1093/biomet/asab042
发表日期:
2022
页码:
387403
关键词:
regularized calibrated estimation variable selection Robust Estimation Missing Data Lasso efficient tests
摘要:
Afundamental challenge in semi-supervised learning lies in the observed data's disproportional size when compared with the size of the data collected with missing outcomes. An implicit understanding is that the dataset with missing outcomes, being significantly larger, ought to improve estimation and inference. However, it is unclear to what extent this is correct. We illustrate one clear benefit: root-n inference of the outcome's mean is possible while only requiring a consistent estimation of the outcome, possibly at a rate slower than root n. This is achieved by a novel k-fold, cross-fitted, double robust estimator. We discuss both linear and nonlinear outcomes. Such an estimator is particularly suited for models that naturally do not admit root-n consistency, such as high-dimensional, nonparametric or semiparametric models. We apply our methods to estimating heterogeneous treatment effects.