Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

成果类型:
Article
署名作者:
Li, Gen; Cai, Changxiao; Chen, Yuxin; Wei, Yuting; Chi, Yuejie
署名单位:
University of Pennsylvania; University of Pennsylvania; Carnegie Mellon University
刊物名称:
OPERATIONS RESEARCH
ISSN/ISSBN:
0030-364X
DOI:
10.1287/opre.2023.2450
发表日期:
2024
关键词:
Q-learning Temporal Difference Learning effective horizon Sample Complexity Minimax Optimality Lower bound overestimation
摘要:
Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made toward understanding the sample efficiency of Q-learning. Consider a gamma-discounted infinite-horizon MDP with state space S and action space A: to yield an entry-wise epsilon-approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of vertical bar S vertical bar vertical bar A vertical bar/(1-gamma)(5)epsilon(2)' which fails to match existing minimax lower bounds. This gives rise to natural questions: What is the sharp sample complexity of Q-learning? Is Q-learning provably suboptimal? This paper addresses these questions for the synchronous setting: (1) when the action space contains a single action (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as vertical bar S vertical bar/(1-gamma)(3)epsilon(2)' (up to log factor); (2) when the action space contains at least two actions, we settle the sample complexity of Q-learning to be on the order of vertical bar S vertical bar vertical bar A vertical bar/(1-gamma)(4)epsilon(2)' (up to log factor). Our theory unveils the strict suboptimality of Q-learning when the action space contains at least two actions and rigorizes the negative impact of overestimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be 1/(1-gamma)(4).
来源URL: