您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 运营管理 > Operations Research > 2024 > 1期

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

成果类型：

Article

署名作者：

Li, Gen; Cai, Changxiao; Chen, Yuxin; Wei, Yuting; Chi, Yuejie

署名单位：

University of Pennsylvania; University of Pennsylvania; Carnegie Mellon University

刊物名称：

OPERATIONS RESEARCH

ISSN/ISSBN：

0030-364X

DOI：

10.1287/opre.2023.2450

发表日期：

2024

关键词：

Q-learning Temporal Difference Learning effective horizon Sample Complexity Minimax Optimality Lower bound overestimation

摘要：

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made toward understanding the sample efficiency of Q-learning. Consider a gamma-discounted infinite-horizon MDP with state space S and action space A: to yield an entry-wise epsilon-approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of vertical bar S vertical bar vertical bar A vertical bar/(1-gamma)(5)epsilon(2)' which fails to match existing minimax lower bounds. This gives rise to natural questions: What is the sharp sample complexity of Q-learning? Is Q-learning provably suboptimal? This paper addresses these questions for the synchronous setting: (1) when the action space contains a single action (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as vertical bar S vertical bar/(1-gamma)(3)epsilon(2)' (up to log factor); (2) when the action space contains at least two actions, we settle the sample complexity of Q-learning to be on the order of vertical bar S vertical bar vertical bar A vertical bar/(1-gamma)(4)epsilon(2)' (up to log factor). Our theory unveils the strict suboptimality of Q-learning when the action space contains at least two actions and rigorizes the negative impact of overestimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be 1/(1-gamma)(4).

来源URL：

访问原文