您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > Mathematics of Operations Research > 2016 > 3期

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

成果类型：

Article

署名作者：

Scherrer, Bruno

署名单位：

Inria

刊物名称：

MATHEMATICS OF OPERATIONS RESEARCH

ISSN/ISSBN：

0364-765X

DOI：

10.1287/moor.2015.0753

发表日期：

2016

页码：

758-774

关键词：

摘要：

Given a Markov decision process (MDP) with n states and a total number m of actions, we study the number of iterations needed by policy iteration (PI) algorithms to converge to the optimal gamma-discounted policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most O((nm/(1-gamma))log(1/(1-gamma))) iterations, improving by a factor O(log n) a result by Hansen et al. [Hansen TD, Miltersen PB, Zwick U (2013) Strategy iteration is strongly polynomial for two-player turn-based stochastic games with a constant discount factor. J. ACM 60(1): 1: 1-1: 16.], whereas Simplex-PI terminates after at most O((nm/(1-gamma))log(1/(1-gamma))) iterations, improving by a factor O(log n) a result by Ye [Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4): 593-603.]. Under some structural properties of the MDP, we then consider bounds that are independent of the discount factor gamma: quantities of interest are bounds tau(t) and tau(r)-uniform on all states and policies-respectively, on the expected time spent in transient states and the inverse of the frequency of visits in recurrent states given that the process starts from the uniform distribution. Indeed, we show that Simplex-PI terminates after at most (O) over tilde (n(2)m(2)tau(t)tau(r)) iterations. This extends a recent result for deterministic MDPs by Post and Ye [Post I, Ye Y (2013) The simplex method is strongly polynomial for deterministic Markov decision processes. Khanna S, ed. Proc. 24th ACM-SIAM Sympos. Discrete Algorithms, SODA '13 (SIAM, Philadelphia), 1465-1473.] in which tau(t) <= 1 and tau(r) <= n; in particular it shows that Simplex-PI is strongly polynomial for a much larger class of MDPs. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively, states that are transient and recurrent for all policies, we show that both Howard's PI and Simplex-PI terminate after at most (O) over tilde (m(n(2)tau(t) + n tau(r))) iterations.

来源URL：

访问原文