Sublinear regret for learning POMDPs

成果类型:
Article
署名作者:
Xiong, Yi; Chen, Ningyuan; Gao, Xuefeng; Zhou, Xiang
署名单位:
Chinese University of Hong Kong; University of Toronto
刊物名称:
PRODUCTION AND OPERATIONS MANAGEMENT
ISSN/ISSBN:
1059-1478
DOI:
10.1111/poms.13778
发表日期:
2022
页码:
3491-3504
关键词:
exploration-exploitation online learning partially observable MDP spectral estimator
摘要:
We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O(T2/3logT)$O(T<^>{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.