Large-Scale Correlation Screening
成果类型:
Article
署名作者:
Hero, Alfred; Rajaratnam, Bala
署名单位:
University of Michigan System; University of Michigan; University of Michigan System; University of Michigan; University of Michigan System; University of Michigan; Stanford University
刊物名称:
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
ISSN/ISSBN:
0162-1459
DOI:
10.1198/jasa.2011.tm11015
发表日期:
2011
页码:
1540-1552
关键词:
asymptotic-distribution
wishart distributions
COVARIANCE ESTIMATION
largest entries
摘要:
This article addresses the problem of screening for variables with high correlations in high-dimensional data in which there can be many fewer samples than variables. We focus on threshold-based correlation screening methods for three related applications: screening for variables with large correlations within a single treatment (autocorrelation screening), screening for variables with large cross-correlations over two treatments (cross-correlation screening), and screening for variables that have persistently large autocorrelations over two treatments (persistent-correlation screening). The novelty of correlation screening is that it identifies a smaller number of variables that are highly correlated with others compared with identifying a number of correlation parameters. Correlation screening suffers from a phase transition phenomenon; as the correlation threshold decreases, the number of discoveries increases abruptly. We obtain asymptotic expressions for the mean number of discoveries and the phase transition thresholds as a function of the number of samples, the number of variables, and the joint sample distribution. We also show that under a weak dependency condition, the number of discoveries is dominated by a Poisson random variable giving an asymptotic expression for the false-positive rate. The correlation screening approach yields tremendous dividends in terms of the type and strength of the asymptotic results that can be obtained. It also overcomes some of the major hurdles faced by existing methods in the literature, because correlation screening is naturally scalable to high dimensions. Numerical results strongly validate the theory presented here. We illustrate the application of the correlation screening methodology on a large-scale gene-expression dataset, revealing a few influential variables that exhibit significant correlation over multiple treatments. This article has supplementary material online.