Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits

成果类型:
Article
署名作者:
Walton, Neil; Denisov, Denis
署名单位:
Durham University
刊物名称:
MATHEMATICS OF OPERATIONS RESEARCH
ISSN/ISSBN:
0364-765X
DOI:
10.1287/moor.2022.1311
发表日期:
2023
页码:
1553-1588
关键词:
摘要:
We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm rather than using a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyze the stability of this Markov chain. We prove that, if learning rates are well-chosen, then the policy gradient algorithm is a transient Markov chain, and the state of the chain converges on the optimal arm with logarithmic or polylogarithmic regret.