-
作者:Jin, Jiashun; Ke, Zheng Tracy; Luo, Shengming; Wang, Minzhe
作者单位:Carnegie Mellon University; Harvard University
摘要:In network analysis, how to estimate the number of communities K is a fundamental problem. We consider a broad setting where we allow severe degree heterogeneity and a wide range of sparsity levels, and propose Stepwise Goodness of Fit (StGoF) as a new approach. This is a stepwise algorithm, where for m = 1, 2, ..., we alternately use a community detection step and a goodness of fit (GoF) step. We adapt SCORE Jin for community detection, and propose a new GoF metric. We show that at step m, th...
-
作者:Scealy, Janice L.; Wood, Andrew T. A.
作者单位:Australian National University
摘要:Compositional data are challenging to analyse due to the non-negativity and sum-to-one constraints on the sample space. With real data, it is often the case that many of the compositional components are highly right-skewed, with large numbers of zeros. Major limitations of currently available models for compositional data include one or more of the following: insufficient flexibility in terms of distributional shape; difficulty in accommodating zeros in the data in estimation; and lack of comp...
-
作者:Vazquez-Bare, Gonzalo
作者单位:University of California System; University of California Santa Barbara
摘要:I set up a potential outcomes framework to analyze spillover effects using instrumental variables. I characterize the population compliance types in a setting in which spillovers can occur on both treatment take-up and outcomes, and provide conditions for identification of the marginal distribution of compliance types. I show that intention-to-treat (ITT) parameters aggregate multiple direct and spillover effects for different compliance types, and hence do not have a clear link to causally in...
-
作者:Zhu, Yichen; Li, Cheng; Dunson, David B.
作者单位:Duke University; National University of Singapore; Duke University
摘要:Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a ne...
-
作者:Xia, Yin; Cai, T. Tony
作者单位:Fudan University; University of Pennsylvania
-
作者:Xu, Ganggang; Liang, Chen; Waagepetersen, Rasmus; Guan, Yongtao
作者单位:University of Miami; State University of New York (SUNY) System; Binghamton University, SUNY; Amazon.com; Aalborg University
摘要:Specification of a parametric model for the intensity function is a fundamental task in statistics for spatial point processes. It is, therefore, crucial to be able to assess the appropriateness of a suggested model for a given point pattern dataset. For this purpose, we develop a new class of semiparametric goodness-of-fit tests for the specified parametric first-order intensity, without assuming a full data generating mechanism that is needed for the existing popular Monte Carlo tests. The p...
-
作者:Dai, Chenguang; Lin, Buyu; Xing, Xin; Liu, Jun S.
作者单位:Harvard University; Virginia Polytechnic Institute & State University
摘要:The Generalized Linear Model (GLM) has been widely used in practice to model counts or other types of non- Gaussian data. This article introduces a framework for feature selection in the GLM that can achieve robust False Discovery Rate (FDR) control. The main idea is to construct a mirror statistic based on data perturbation to measure the importance of each feature. FDR control is achieved by taking advantage of themirror statistic's property that its sampling distribution is (asymptotically)...
-
作者:Li, Lexin; Zeng, Jing; Zhang, Xin
作者单位:University of California System; University of California Berkeley; State University System of Florida; Florida State University; Chinese Academy of Sciences; University of Science & Technology of China, CAS
摘要:Multimodal data are now prevailing in scientific research. One of the central questions in multimodal integrative analysis is to understand how two data modalities associate and interact with each other given another modality or demographic variables. The problem can be formulated as studying the associations among three sets of random variables, a question that has received relatively less attention in the literature. In this article, we propose a novel generalized liquid association analysis...
-
作者:Ma, Haiqiang; Jiang, Jiming
作者单位:Jiangxi University of Finance & Economics; University of California System; University of California Davis
摘要:We propose a new classified mixed model prediction (CMMP) procedure, called pseudo-Bayesian CMMP, that uses network information in matching the group index between the training data and new data, whose characteristics of interest one wishes to predict. The current CMMP procedures do not incorporate such information; as a result, the methods are not consistent in terms of matching the group index. Although, as the number of training data groups increases, the current CMMP method can predict the...
-
作者:Guo, Xinzhou; Wei, Waverly; Liu, Molei; Cai, Tianxi; Wu, Chong; Wang, Jingshen
作者单位:Hong Kong University of Science & Technology; University of California System; University of California Berkeley; Harvard University; Harvard T.H. Chan School of Public Health; University of Texas System; UTMD Anderson Cancer Center
摘要:There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset type II diabetes (T2D). Nevertheless, to date, there is no robust evidence supporting as to whether and what kind of populations are indeed vulnerable for developing T2D after taking statins. In this case study, leveraging the biobank and electronic health record data in the Partner Health Syst...