您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 统计学 > The Annals of Statistics > 2024 > 1期

SETTLING THE SAMPLE COMPLEXITY OF MODEL-BASED OFFLINE REINFORCEMENT LEARNING

成果类型：

Article

署名作者：

Li, Gen; Shi, Laixi; Chen, Yuxin; Chi, Yuejie; Wei, Yuting

署名单位：

Chinese University of Hong Kong; California Institute of Technology; University of Pennsylvania; Carnegie Mellon University

刊物名称：

ANNALS OF STATISTICS

ISSN/ISSBN：

0090-5364

DOI：

10.1214/23-AOS2342

发表日期：

2024

页码：

233-260

关键词：

Bounds

摘要：

This paper is concerned with offline reinforcement learning (RL), which learns using precollected data without further exploration. Effective offline RL would be able to accommodate distribution shift and limited data coverage. However, prior results either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality, thus posing an impediment to efficient offline RL in sample-starved applications. We demonstrate that the model-based (or plug-in) approach achieves minimax-optimal sample complexity without any burn-in cost for tabular Markov decision processes (MDPs). Concretely, consider a gamma-discounted infinite-horizon (resp., finite-horizon) MDP with S states and effective horizon 1/1-gamma (resp., horizon H), and suppose the distribution shift of data is reflected by some single-policy clipped concentrability coefficient C-clipped*. We prove that model-based offline RL yields epsilon-accuracy with a sample complexity of {SCclipped*/(1 - gamma)(3)epsilon(2 )(infinite-horizon MDPs), (HSCclipped)-S-4*/epsilon(2) (finite-horizon MDPs), up to log factor, which is minimax optimal for the entire epsilon-range. The proposed algorithms are pessimistic variants of value iteration with Bernsteinstyle penalties, and do not require sophisticated variance reduction. Our analysis framework is established upon delicate leave-one-out decoupling arguments in conjunction with careful self-bounding techniques tailored to MDPs.