您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > Mathematics of Operations Research > 2023 > 1期

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

成果类型：

Article

署名作者：

Agrawal, Shipra; Jia, Randy

署名单位：

Columbia University

刊物名称：

MATHEMATICS OF OPERATIONS RESEARCH

ISSN/ISSBN：

0364-765X

DOI：

10.1287/moor.2022.1266

发表日期：

2023

页码：

363-392

关键词：

摘要：

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of (O) over tilde (DS root AT) for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of Omega(root DSAT). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.