您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 统计学 > The Annals of Applied Statistics > 2021 > 3期

ORTHOGONAL SUBSAMPLING FOR BIG DATA LINEAR REGRESSION

成果类型：

Article

署名作者：

Wang, Lin; Elmstedt, Jake; Wong, Weng Kee; Xu, Hongquan

署名单位：

George Washington University; University of California System; University of California Los Angeles; University of California System; University of California Los Angeles

刊物名称：

ANNALS OF APPLIED STATISTICS

ISSN/ISSBN：

1932-6157

DOI：

10.1214/21-AOAS1462

发表日期：

2021

关键词：

exchange algorithms bayesian design regularization CONSTRUCTION selection arrays

摘要：

The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big-data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points, and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates, and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data.

来源URL：

访问原文