-
作者:Meng, Kun; Wang, Jinyu; Crawford, Lorin; Eloyan, Ani
作者单位:Brown University; Brown University; Brown University; Microsoft
摘要:In this article, we establish the mathematical foundations for modeling the randomness of shapes and conducting statistical inference on shapes using the smooth Euler characteristic transform. Based on these foundations, we propose two Chi-squared statistic-based algorithms for testing hypotheses on random shapes. Simulation studies are presented to validate our mathematical derivations and to compare our algorithms with state-of-the-art methods to demonstrate the utility of our proposed frame...
-
作者:Gao, Youqian; Dai, Ben
作者单位:Chinese University of Hong Kong
摘要:The technique of word embedding is widely used in natural language processing (NLP) to represent words as numerical vectors in textual datasets. However, the estimation of word embedding may suffer from severe overfitting due to the huge variety of words. To address the issue, this article proposes a novel regularization framework that recognizes and accounts for the word-level distribution discrepancy-a common phenomenon in a range of NLP tasks where word distributions are noticeably disparat...
-
作者:Morey, Richard D.; Davis-Stober, Clintin P.
作者单位:Cardiff University; University of Missouri System; University of Missouri Columbia; University of Missouri System; University of Missouri Columbia
摘要:The P-curve is a widely used suite of meta-analytic tests advertised for detecting problems in sets of studies. They are based on nonparametric combinations of p values (e.g., Marden) across significant (p < .05) studies and are variously claimed to detect evidential value, lack of evidential value, and left skew in p values. We show that these tests do not have the properties ascribed to them. Moreover, they fail basic desiderata for tests, including admissibility and monotonicity. In light o...
-
作者:Watson, Samuel I.; Smith, Thomas A.
作者单位:University of Birmingham
摘要:In this article, we consider randomized trial design to evaluate interventions with spatially or spatio-temporally heterogeneous effects. A common approach in this setting is the cluster randomized trial. In many cases, clusters are constituted as discrete subdivisions of a contiguous area of interest. However, cluster trials designed in this way may suffer from issues of spillover and may fail to capture the relevant spatial and temporal effects. We define possible randomization schemes and c...
-
作者:Cai, Yun; Gu, Hong; Kenney, Toby
作者单位:Dalhousie University
摘要:Deconvolution is the important problem of estimating the distribution of a quantity of interest from a sample with additive measurement error. Nearly all infinite-dimensional deconvolution methods in the literature use Fourier transformations. These methods are mathematically neat, but unstable, and produce bad estimates when signal-noise ratio or sample size are low. A popular alternative is to maximize penalized likelihood for a finite-dimensional basis expansion of the unknown density. We d...
-
作者:Joseph, V. Roshan
作者单位:University System of Georgia; Georgia Institute of Technology
-
作者:Kook, Lucas; Saengkyongam, Sorawit; Lundborg, Anton Rask; Hothorn, Torsten; Peters, Jonas
作者单位:Swiss Federal Institutes of Technology Domain; ETH Zurich; University of Copenhagen; Swiss School of Public Health (SSPH+); University of Zurich
摘要:Discovering causal relationships from observational data is a fundamental yet challenging task. Invariant causal prediction (ICP, Peters, B & uuml;hlmann, and Meinshausen) is a method for causal feature selection which requires data from heterogeneous settings and exploits that causal models are invariant. ICP has been extended to general additive noise models and to nonparametric settings using conditional independence tests. However, the latter often suffer from low power (or poor Type I err...
-
作者:Leiner, James; Duan, Boyan; Wasserman, Larry; Ramdas, Aaditya
作者单位:Carnegie Mellon University; Alphabet Inc.; Google Incorporated
摘要:Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that use...
-
作者:Lemyre, Felix Camirand; Carroll, Raymond J.; Delaigle, Aurore
作者单位:University of Sherbrooke; Texas A&M University System; Texas A&M University College Station; University of Melbourne
摘要:We consider nonparametric estimation of the density of the long-term trend of a semicontinuous variable observed repeatedly over time. These variables arise when measuring the intensity of an intermittent phenomenon, such as the intake of an episodically consumed nutrient or the concentration of an intermittent toxic substance: when the phenomenon is absent, the measurement is equal to zero; otherwise, it is positive. Semicontinuous data are usually represented by a two-part model describing t...
-
作者:Wang, Zhijing; Xu, Peirong; Zhao, Hongyu; Wang, Tao
作者单位:Shanghai Jiao Tong University; Yale University; Shanghai Jiao Tong University; Shanghai Jiao Tong University
摘要:The Poisson factor model is a powerful tool for dimension reduction and visualization of large-scale count datasets across diverse domains. Despite the availability of efficient algorithms for estimating factors and loadings, existing methods either require prior knowledge of the number of factors, or resort to ad hoc criteria for its determination. This article proposes a novel data-driven criterion called Information Criterion via Data Thinning (ICDT), leveraging the thinning property of the...