AN lp THEORY OF PCA AND SPECTRAL CLUSTERING

成果类型:
Article
署名作者:
Abbe, Emmanuel; Fan, Jianqing; Wang, Kaizheng
署名单位:
Swiss Federal Institutes of Technology Domain; Ecole Polytechnique Federale de Lausanne; Princeton University; Columbia University; Columbia University
刊物名称:
ANNALS OF STATISTICS
ISSN/ISSBN:
0090-5364
DOI:
10.1214/22-AOS2196
发表日期:
2022
页码:
2359-2385
关键词:
Principal component analysis community detection LARGEST EIGENVALUE incomplete data exact recovery sample-size bounds matrices perturbation asymptotics
摘要:
Principal Component Analysis (PCA) is a powerful tool in statistics and machine learning. While existing study of PCA focuses on the recovery of principal components and their associated eigenvalues, there are few precise characterizations of individual principal component scores that yield low-dimensional embedding of samples. That hinders the analysis of various spectral methods. In this paper, we first develop an l(p) perturbation theory for a hollowed version of PCA in Hilbert spaces which provably improves upon the vanilla PCA in the presence of heteroscedastic noises. Through a novel l(p) analysis of eigenvectors, we investigate entrywise behaviors of principal component score vectors and show that they can be approximated by linear functionals of the Gram matrix in l(p) norm, which includes l(2) and l(infinity) as special cases. For sub-Gaussian mixture models, the choice of p giving optimal bounds depends on the signal-to-noise ratio, which further yields optimality guarantees for spectral clustering. For contextual community detection, the l(p) theory leads to simple spectral algorithms that achieve the information threshold for exact recovery and the optimal misclassification rate.