Neural Temporal Difference and Q Learning Provably Converge to Global Optima

成果类型:
Article
署名作者:
Cai, Qi; Yang, Zhuoran; Lee, Jason D.; Wang, Zhaoran
署名单位:
Northwestern University; Yale University; Princeton University
刊物名称:
MATHEMATICS OF OPERATIONS RESEARCH
ISSN/ISSBN:
0364-765X
DOI:
10.1287/moor.2023.1370
发表日期:
2024
关键词:
algorithms
摘要:
Temporal difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, because of the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the over parameterization of neural networks, which also plays a vital role in the empirical success of neural TD. We establish the theory for two-layer neural networks in the main paper and extend them to multilayer neural networks in the appendix. Beyond policy evaluation, we establish the global convergence of neural (soft) Q learning.