您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 统计学 > The Annals of Statistics > 2025 > 2期

DEEP APPROXIMATE POLICY ITERATION

成果类型：

Article

署名作者：

Jiao, Yuling; Kang, Lican; Liu, Jin; Liu, Xiliang; Yang, Jerry zhijian

署名单位：

Wuhan University; Wuhan University; The Chinese University of Hong Kong, Shenzhen; Wuhan University; Wuhan University; Wuhan University

刊物名称：

ANNALS OF STATISTICS

ISSN/ISSBN：

0090-5364

DOI：

10.1214/24-AOS2486

发表日期：

2025

页码：

802-821

关键词：

convolutional neural-networks Empirical Processes bounds error game

摘要：

In this paper, we consider deep approximate policy iteration (DAPI) with the Bellman residual minimization in reinforcement learning. In each iteration of DAPI, we apply convolutional neural networks (CNNs) with ReLU activation, called ReLU CNNs, to estimate the fixed point of the Bellman equation by minimizing an unbiased minimax loss. To bound the estimation error in each iteration, we control the statistical and approximation errors using the tools of the empirical process theory with dependent data and deep approximation theory, respectively. We establish a novel statistical error bound for ReLU CNNs on dependent data that is C-mixing, and an approximation error bound for ReLU CNNs on H & ouml;lder class. Combining with error propagation, we obtain a nonasymptotic error bound between the optimal action-value function Q* and the estimated Q function induced by the greedy policy in DAPI. This bound depends on the sample size and ambient dimension of the data, as well as the size, weight bound, and depth of the CNNs, providing prior guidance on how to set these hyperparameters to achieve the desired convergence rate when training DAPI in practice. Moreover, this bound circumvents the curse of dimensionality if the distribution of state-action pairs is supported on a set with a low intrinsic dimension.