Convex clustering via l1 fusion penalization
成果类型:
Article
署名作者:
Radchenko, Peter; Mukherjee, Gourab
署名单位:
University of Southern California; University of Sydney
刊物名称:
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
ISSN/ISSBN:
1369-7412
DOI:
10.1111/rssb.12223
发表日期:
2017
页码:
1527-1546
关键词:
cell mass cytometry
REGRESSION SHRINKAGE
feature-selection
Data set
number
mixture
Lasso
摘要:
We study the large sample behaviour of a convex clustering framework, which minimizes the sample within cluster sum of squares under an l(1) fusion constraint on the cluster centroids. This recently proposed approach has been gaining in popularity; however, its asymptotic properties have remained mostly unknown. Our analysis is based on a novel representation of the sample clustering procedure as a sequence of cluster splits determined by a sequence of maximization problems. We use this representation to provide a simple and intuitive formulation for the population clustering procedure. We then demonstrate that the sample procedure consistently estimates its population analogue and we derive the corresponding rates of convergence. The proof conducts a careful simultaneous analysis of a collection of M-estimation problems, whose cardinality grows together with the sample size. On the basis of the new perspectives gained from the asymptotic investigation, we propose a key post-processing modification of the original clustering framework. We show, both theoretically and empirically, that the resulting approach can be successfully used to estimate the number of clusters in the population. Using simulated data, we compare the proposed method with existing number-of-clusters and modality assessment approaches and obtain encouraging results. We also demonstrate the applicability of our clustering method to the detection of cellular subpopulations in a single-cell virology study.
来源URL: