Dimension Reduction for Large-Scale Federated Data: Statistical Rate and Asymptotic Inference

成果类型:
Article; Early Access
署名作者:
Shen, Shuting; Lu, Junwei; Lin, Xihong
署名单位:
National University of Singapore; Harvard University; Harvard T.H. Chan School of Public Health; Harvard University
刊物名称:
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
ISSN/ISSBN:
0162-1459
DOI:
10.1080/01621459.2025.2537453
发表日期:
2025
关键词:
largest eigenvalue principal eigenstructure
摘要:
In light of the rapidly growing large-scale data in federated ecosystems, the traditional principal component analysis (PCA) is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under distributed settings. In this article, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension d and the sample size n are ultra-large, by simultaneously performing parallel computing along d and distributed computing along n. Specifically, we use L parallel copies of p-dimensional fast sketches to divide the computing burden along d and aggregate the results distributively along the split samples. We present a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI accelerates the computation while enjoying the same non-asymptotic error rate as the traditional PCA when Lp >= d . We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as Lp increases. We perform extensive simulations to empirically validate our theoretical findings, and apply FADI to the 1000 Genomes data to study the population structure. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.