INFLUENTIAL FEATURES PCA FOR HIGH DIMENSIONAL CLUSTERING

成果类型:
Article
署名作者:
Jin, Jiashun; Wang, Wanjie
署名单位:
Carnegie Mellon University; National University of Singapore
刊物名称:
ANNALS OF STATISTICS
ISSN/ISSBN:
0090-5364
DOI:
10.1214/15-AOS1423
发表日期:
2016
页码:
2323-2359
关键词:
higher criticism large deviations CLASSIFICATION components inference graph rare
摘要:
We consider a clustering problem where we observe feature vectors X-i is an element of R-P, i = 1, 2,..., n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p >> n, where classical clustering methods face challenges. We propose Influential Features PCA (IF-PCA) as a new clustering procedure. In IF-PCA, we select a small fraction of features with the largest Kolmogorov Smirnov (KS) scores, obtain the first (K-1) left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical k-means procedure to these singular vectors. In this procedure, the only tuning parameter is the threshold in the feature selection step. We set the threshold in a data-driven fashion by adapting the recent notion of Higher Criticism. As a result, IF-PCA is a tuning-free clustering method. We apply IF-PCA to 10 gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IF-PCA are only 29% or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by Efron [J. Amer. Statist. Assoc. 99 (2004) 96-104] on microarray data. With delicate analysis, especially post-selection eigen-analysis, we derive tight probability bounds on the Kolmogorov Smirnov statistics and show that IF-PCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and low-rank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each.