您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 统计学 > The Annals of Statistics > 2022 > 2期

ADAPTIVE ESTIMATION IN MULTIVARIATE RESPONSE REGRESSION WITH HIDDEN VARIABLES

成果类型：

Article

署名作者：

Bing, Xin; Ning, Yang; Xu, Yaosheng

署名单位：

Cornell University

刊物名称：

ANNALS OF STATISTICS

ISSN/ISSBN：

0090-5364

DOI：

10.1214/21-AOS2059

发表日期：

2022

页码：

640-672

关键词：

square-root lasso gene-expression matrix selection number rank EIGENVALUE inference RECOVERY

摘要：

A prominent concern of scientific investigators is the presence of unobserved hidden variables in association analysis. Ignoring hidden variables often yields biased statistical results and misleading scientific conclusions. Motivated by this practical issue, this paper studies the multivariate response regression with hidden variables, Y = (psi*)(T) X + (B*)(T) Z + E, where Y is an element of R-m is the response vector, X is an element of R-p is the observable feature, Z is an element of R-K represents the vector of unobserved hidden variables, possibly correlated with X, and E is an independent error. The number of hidden variables K is unknown and both m and p are allowed, but not required, to grow with the sample size n. Though psi* is shown to be nonidentifiable due to the presence of hidden variables, we propose to identify the projection of psi* onto the orthogonal complement of the row space of B*, denoted by Theta*. The quantity (Theta*)(T) X measures the effect of X on Y that cannot be explained through the hidden variables, and thus Theta* is treated as the parameter of interest. Motivated by the identifiability proof, we propose a novel estimation algorithm for Theta*, called HIVE, under homoscedastic errors. The first step of the algorithm estimates the best linear prediction of Y given X, in which the unknown coefficient matrix exhibits an additive decomposition of psi* and a dense matrix due to the correlation between X and Z. Under the sparsity assumption on psi*, we propose to minimize a penalized least squares loss by regularizing psi* and the dense matrix via group-lasso and multivariate ridge, respectively. Nonasymptotic deviation bounds of the in-sample prediction error are established. Our second step estimates the row space of B* by leveraging the covariance structure of the residual vector from the first step. In the last step, we estimate Theta* via projecting Y onto the orthogonal complement of the estimated row space of B* to remove the effect of hidden variables. Nonasymptotic error bounds of our final estimator of Theta*, which are valid for any m, p, K and n, are established. We further show that, under mild assumptions, the rate of our estimator matches the best possible rate with known B* and is adaptive to the unknown sparsity of Theta* induced by the sparsity of psi*. The model identifiability, estimation algorithm and statistical guarantees are further extended to the setting with heteroscedastic errors. Thorough numerical simulations and two real data examples are provided to back up our theoretical results.