Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

成果类型:
Article
署名作者:
Yan, Yuling; Li, Gen; Chen, Yuxin; Fan, Jianqing
署名单位:
Massachusetts Institute of Technology (MIT); Chinese University of Hong Kong; University of Pennsylvania; Princeton University
刊物名称:
OPERATIONS RESEARCH
ISSN/ISSBN:
0030-364X
DOI:
10.1287/opre.2022.0342
发表日期:
2024
页码:
2430-2445
关键词:
sample complexity performance bounds LEVEL
摘要:
This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a gamma-discounted, infinite-horizon Markov game with S states, in which the max-player has A actions and the min-player has B actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds-called the value iteration with lower confidence bounds for zero-sum Markov games-that provably finds an epsilon-approximate Nash equilibrium with a sample complexity no larger than C-clipped(star) S(A+B)/(1-gamma)(3) epsilon(2) (up to some log factor). Here, C-clipped(star) is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-a-vis the target data), and the target accuracy epsilon can be any value within (0, 1/1-gamma]. . Our sample complexity bound strengthens prior art by a factor of min{A, B}, achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.