-
作者:Gao, Chao; Zhang, Anderson Y.
作者单位:University of Chicago; University of Pennsylvania
摘要:We propose a general modeling and algorithmic framework for discrete structure recovery that can be applied to a wide range of problems. Under this framework, we are able to study the recovery of clustering labels, ranks of players, signs of regression coefficients, cyclic shifts and even group elements from a unified perspective. A simple iterative algorithm is proposed for discrete structure recovery, which generalizes methods including Lloyd's algorithm and the power method. A linear conver...
-
作者:Silin, Igor; Fan, Jianqing
作者单位:Princeton University
摘要:We consider a high-dimensional linear regression problem. Unlike many papers on the topic, we do not require sparsity of the regression coefficients; instead, our main structural assumption is a decay of eigenvalues of the covariance matrix of the data. We propose a new family of estimators, called the canonical thresholding estimators, which pick largest regression coefficients in the canonical form. The estimators admit an explicit form and can be linked to LASSO and Principal Component Regr...
-
作者:Bing, Xin; Bunea, Florentina; Strimas-mackey, Seth; Wegkamp, Marten
作者单位:University of Toronto; Cornell University; Cornell University; Cornell University
摘要:This paper studies the estimation of high-dimensional, discrete, possibly sparse, mixture models in the context of topic models. The data consists of observed multinomial counts of p words across n independent documents. In topic models, the p x n expected word frequency matrix is assumed to be factorized as a p x K word-topic matrix A and a K x n topic-document matrix T. Since columns of both matrices represent conditional probabilities belonging to probability simplices, columns of A are vie...
-
作者:Wang, Yuhao; Li, Xinran
作者单位:Tsinghua University; University of Illinois System; University of Illinois Urbana-Champaign
摘要:Completely randomized experiments have been the gold standard for drawing causal inference because they can balance all potential confounding on average. However, they may suffer from unbalanced covariates for real-ized treatment assignments. Rerandomization, a design that rerandomizes the treatment assignment until a prespecified covariate balance criterion is met, has recently got attention due to its easy implementation, improved covari-ate balance and more efficient inference. Researchers ...
-
作者:Gao, Fengnan; Wang, Tengyao
作者单位:Fudan University; University of London; London School Economics & Political Science
摘要:We introduce a new method for two-sample testing of high-dimensional linear regression coefficients without assuming that those coefficients are individually estimable. The procedure works by first projecting the matrices of covariates and response vectors along directions that are complementary in sign in a subset of the coordinates, a process which we call complementary sketching. The resulting projected covariates and responses are aggregated to form two test statistics, which are shown to ...
-
作者:Wang, Jingshu; Gui, Lin; Su, Weijie J.; Sabatti, Chiara; Owen, Art B.
作者单位:University of Chicago; University of Pennsylvania; Stanford University
摘要:Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, different populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple studies. In many contemporary applications, for example, comparing multiple high-throughput g...
-
作者:Roquain, Etienne; Verzelen, Nicolas
作者单位:Universite Paris Cite; Centre National de la Recherche Scientifique (CNRS); Universite Paris Cite; Sorbonne Universite; INRAE; Universite de Montpellier; Institut Agro
摘要:Classical multiple testing theory prescribes the null distribution, which is often too stringent an assumption for nowadays large scale experiments. This paper presents theoretical foundations to understand the limitations caused by ignoring the null distribution, and how it can be properly learned from the same data set, when possible. We explore this issue in the setting where the null distributions are Gaussian with unknown rescaling parameters (mean and variance) whereas the alternative di...
-
作者:Depersin, Jules; Lecue, Guillaume
作者单位:Institut Polytechnique de Paris; ENSAE Paris
摘要:We construct an algorithm for estimating the mean of a heavy-tailed random variable when given an adversarial corrupted sample of N independent observations. The only assumption we make on the distribution of the non-corrupted (or informative) data is the existence of a covariance matrix Sigma, unknown to the statistician. Our algorithm outputs (mu) over cap, which is robust to the presence of vertical bar O vertical bar adversarial outliers and satisfies parallel to(mu) over cap - mu parallel...
-
作者:Einmahl, John H. J.; Ferreira, Ana; de Haan, Laurens; Neves, Claudia; Zhou, Chen
作者单位:Tilburg University; Universidade de Lisboa; Universidade de Lisboa; Erasmus University Rotterdam - Excl Erasmus MC; Erasmus University Rotterdam; University of Reading; Erasmus University Rotterdam; Erasmus University Rotterdam - Excl Erasmus MC
摘要:The statistical theory of extremes is extended to independent multivariate observations that are non-stationary both over time and across space. The non-stationarity over time and space is controlled via the scedasis (tail scale) in the marginal distributions. Spatial dependence stems from multivariate extreme value theory. We establish asymptotic theory for both the weighted sequential tail empirical process and the weighted tail quantile process based on all observations, taken over time and...
-
作者:Wong, Kin Yau; Zeng, Donglin; Lin, D. Y.
作者单位:Hong Kong Polytechnic University; University of North Carolina; University of North Carolina Chapel Hill
摘要:In long-term follow-up studies, data are often collected on repeated measures of multivariate response variables as well as on time to the occurrence of a certain event. To jointly analyze such longitudinal data and survival time, we propose a general class of semiparametric latent-class models that accommodates a heterogeneous study population with flexible dependence structures between the longitudinal and survival outcomes. We combine nonparametric maximum likelihood estimation with sieve e...