DEEP APPROXIMATE POLICY ITERATION

成果类型:
Article
署名作者:
Jiao, Yuling; Kang, Lican; Liu, Jin; Liu, Xiliang; Yang, Jerry zhijian
署名单位:
Wuhan University; Wuhan University; The Chinese University of Hong Kong, Shenzhen; Wuhan University; Wuhan University; Wuhan University
刊物名称:
ANNALS OF STATISTICS
ISSN/ISSBN:
0090-5364
DOI:
10.1214/24-AOS2486
发表日期:
2025
页码:
802-821
关键词:
convolutional neural-networks Empirical Processes bounds error game
摘要:
In this paper, we consider deep approximate policy iteration (DAPI) with the Bellman residual minimization in reinforcement learning. In each iteration of DAPI, we apply convolutional neural networks (CNNs) with ReLU activation, called ReLU CNNs, to estimate the fixed point of the Bellman equation by minimizing an unbiased minimax loss. To bound the estimation error in each iteration, we control the statistical and approximation errors using the tools of the empirical process theory with dependent data and deep approximation theory, respectively. We establish a novel statistical error bound for ReLU CNNs on dependent data that is C-mixing, and an approximation error bound for ReLU CNNs on H & ouml;lder class. Combining with error propagation, we obtain a nonasymptotic error bound between the optimal action-value function Q* and the estimated Q function induced by the greedy policy in DAPI. This bound depends on the sample size and ambient dimension of the data, as well as the size, weight bound, and depth of the CNNs, providing prior guidance on how to set these hyperparameters to achieve the desired convergence rate when training DAPI in practice. Moreover, this bound circumvents the curse of dimensionality if the distribution of state-action pairs is supported on a set with a low intrinsic dimension.