Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds
成果类型:
Article
署名作者:
Agrawal, Shipra; Jia, Randy
署名单位:
Columbia University
刊物名称:
MATHEMATICS OF OPERATIONS RESEARCH
ISSN/ISSBN:
0364-765X
DOI:
10.1287/moor.2022.1266
发表日期:
2023
页码:
363-392
关键词:
摘要:
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of (O) over tilde (DS root AT) for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of Omega(root DSAT). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.