您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > Mathematics of Operations Research > 2024

Is Pessimism Provably Efficient for Offline Reinforcement Learning?

成果类型：

Article; Early Access

署名作者：

Jin, Ying; Yang, Zhuoran; Wang, Zhaoran

署名单位：

Harvard University; Yale University; Northwestern University

刊物名称：

MATHEMATICS OF OPERATIONS RESEARCH

ISSN/ISSBN：

0364-765X

DOI：

10.1287/moor.2022.0216

发表日期：

2024

关键词：

policy iteration game bounds go

摘要：

We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a data set collected a priori. Because of the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the data set, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the data set (e.g., finite concentratability coefficients or uniformly lower-bounded densities of visitation measures), we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also, minimax optimal. In particular, given the data set, the learned policy serves as the best effort among all policies as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which arises from the irrelevant trajectories that are less covered by the data set and not informative for the optimal policy.

来源URL：

访问原文