您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 运营管理 > Production and Operations Management > 2022 > 9期

Sublinear regret for learning POMDPs

成果类型：

Article

署名作者：

Xiong, Yi; Chen, Ningyuan; Gao, Xuefeng; Zhou, Xiang

署名单位：

Chinese University of Hong Kong; University of Toronto

刊物名称：

PRODUCTION AND OPERATIONS MANAGEMENT

ISSN/ISSBN：

1059-1478

DOI：

10.1111/poms.13778

发表日期：

2022

页码：

3491-3504

关键词：

exploration-exploitation online learning partially observable MDP spectral estimator

摘要：

We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O(T2/3logT)$O(T<^>{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.