-
作者:Wu, Xiaoyang; Huo, Yuyang; Ren, Haojie; Zou, Changliang
作者单位:Nankai University; Shanghai Jiao Tong University
摘要:In the big data era, subsampling or sub-data selection techniques are often adopted to extract a fraction of informative individuals from the massive data. Existing subsampling algorithms focus mainly on obtaining a representative subset to achieve the best estimation accuracy under a given class of models. In this article, we consider a semi-supervised setting wherein a small or moderate sized labeled data is available in addition to a much larger sized unlabeled data. The goal is to sample f...
-
作者:Catalano, Marta; Lavenant, Hugo; Lijoi, Antonio; Prunster, Igor
作者单位:Luiss Guido Carli University; Bocconi University; Bocconi University; Bocconi University; Bocconi University
摘要:Optimal transport and Wasserstein distances are flourishing in many scientific fields as a means for comparing and connecting random structures. Here we pioneer the use of an optimal transport distance between Levy measures to solve a statistical problem. Dependent Bayesian nonparametric models provide flexible inference on distinct, yet related, groups of observations. Each component of a vector of random measures models a group of exchangeable observations, while their dependence regulates t...
-
作者:Ye, Ting; Keele, Luke; Hasegawa, Raiden; Small, Dylan S.
作者单位:University of Washington; University of Washington Seattle; University of Pennsylvania; Alphabet Inc.; Google Incorporated
摘要:The method of difference-in-differences (DID) is widely used to study the causal effect of policy interventions in observational studies. DID employs a before and after comparison of the treated and control units to remove bias due to time-invariant unmeasured confounders under the parallel trends assumption. Estimates from DID, however, will be biased if the outcomes for the treated and control units evolve differently in the absence of treatment, namely if the parallel trends assumption is v...
-
作者:Han, Sukjin
作者单位:University of Bristol
摘要:Dynamic treatment regimes are treatment allocations tailored to heterogeneous individuals (e.g., via previous outcomes and covariates). The optimal dynamic treatment regime is a regime that maximizes counterfactual welfare. We introduce a framework in which we can partially learn the optimal dynamic regime from observational data, relaxing the sequential randomization assumption commonly employed in the literature but instead using (binary) instrumental variables. We propose the notion of shar...
-
作者:Li, Sai; Zhang, Linjun; Cai, T. Tony; Li, Hongzhe
作者单位:Renmin University of China; Rutgers University System; Rutgers University New Brunswick; University of Pennsylvania; University of Pennsylvania; University of Pennsylvania
摘要:Transfer learning provides a powerful tool for incorporating data from related studies into a target study of interest. In epidemiology and medical studies, the classification of a target disease could borrow information across other related diseases and populations. In this work, we consider transfer learning for high-dimensional Generalized Linear Models (GLMs). A novel algorithm, TransHDGLM, that integrates data from the target study and the source studies is proposed. Minimax rate of conve...
-
作者:Hu, Xiaoyu; Lei, Jing
作者单位:Peking University; Carnegie Mellon University
摘要:We consider the problem of testing the equality of conditional distributions of a response variable given a vector of covariates between two populations. Such a hypothesis testing problem can be motivated from various machine learning and statistical inference scenarios, including transfer learning and causal predictive inference. We develop a nonparametric test procedure inspired from the conformal prediction framework. The construction of our test statistic combines recent developments in co...
-
作者:Teng, Hao Yang; Zhang, Zhengjun
作者单位:Arkansas State University; University of Wisconsin System; University of Wisconsin Madison
摘要:This article introduces a new type of linear regression model with regularization. Each predictor is conditionally truncated through the presence of unknown thresholds. The new model, called the two-way truncated linear regression model (TWT-LR), is not only viewed as a nonlinear generalization of a linear model but is also a much more flexible model with greatly enhanced interpretability and applicability. The TWT-LR model performs classifications through thresholds similar to the tree-based ...
-
作者:Park, Seyoung; Lee, Eun Ryung; Zhao, Hongyu
作者单位:Sungkyunkwan University (SKKU); Yale University
摘要:In this article, we study high-dimensional multivariate logistic regression models in which a common set of covariates is used to predict multiple binary outcomes simultaneously. Our work is primarily motivated from many biomedical studies with correlated multiple responses such as the cancer cell-line encyclopedia project. We assume that the underlying regression coefficient matrix is simultaneously low-rank and row-wise sparse. We propose an intuitively appealing selection and estimation fra...
-
作者:Chen, Elynn Y.; Song, Rui; Jordan, Michael I.
作者单位:New York University; Amazon.com; University of California System; University of California Berkeley
摘要:Reinforcement Learning holds great promise for data-driven decision-making in various social contexts, including healthcare, education, and business. However, classical methods that focus on the mean of the total return may yield misleading results when dealing with heterogeneous populations typically found in large-scale datasets. To address this issue, we introduce the K-Value Heterogeneous Markov Decision Process, a framework designed to handle sequential decision problems with latent popul...
-
作者:Li, Jingyi Jessica; Zhou, Heather J.; Bickel, Peter J.; Tong, Xin
作者单位:University of California System; University of California Los Angeles; University of California System; University of California Berkeley; University of Southern California
摘要:Motivated by the pressing needs for dissecting heterogeneous relationships in gene expression data, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued variables, with or without an index variable that specifies the line memberships. We construct the generalized Pearson correlation squares by focusing on three aspects: variable exchangeability, no parametric model assumptions, and inference of population-level parameters. To com...