Is Pessimism Provably Efficient for Offline Reinforcement Learning?

成果类型:
Article; Early Access
署名作者:
Jin, Ying; Yang, Zhuoran; Wang, Zhaoran
署名单位:
Harvard University; Yale University; Northwestern University
刊物名称:
MATHEMATICS OF OPERATIONS RESEARCH
ISSN/ISSBN:
0364-765X
DOI:
10.1287/moor.2022.0216
发表日期:
2024
关键词:
policy iteration game bounds go
摘要:
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a data set collected a priori. Because of the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the data set, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the data set (e.g., finite concentratability coefficients or uniformly lower-bounded densities of visitation measures), we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also, minimax optimal. In particular, given the data set, the learned policy serves as the best effort among all policies as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which arises from the irrelevant trajectories that are less covered by the data set and not informative for the optimal policy.
来源URL: