Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

成果类型:
Article
署名作者:
Scherrer, Bruno
署名单位:
Inria
刊物名称:
MATHEMATICS OF OPERATIONS RESEARCH
ISSN/ISSBN:
0364-765X
DOI:
10.1287/moor.2015.0753
发表日期:
2016
页码:
758-774
关键词:
摘要:
Given a Markov decision process (MDP) with n states and a total number m of actions, we study the number of iterations needed by policy iteration (PI) algorithms to converge to the optimal gamma-discounted policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most O((nm/(1-gamma))log(1/(1-gamma))) iterations, improving by a factor O(log n) a result by Hansen et al. [Hansen TD, Miltersen PB, Zwick U (2013) Strategy iteration is strongly polynomial for two-player turn-based stochastic games with a constant discount factor. J. ACM 60(1): 1: 1-1: 16.], whereas Simplex-PI terminates after at most O((nm/(1-gamma))log(1/(1-gamma))) iterations, improving by a factor O(log n) a result by Ye [Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4): 593-603.]. Under some structural properties of the MDP, we then consider bounds that are independent of the discount factor gamma: quantities of interest are bounds tau(t) and tau(r)-uniform on all states and policies-respectively, on the expected time spent in transient states and the inverse of the frequency of visits in recurrent states given that the process starts from the uniform distribution. Indeed, we show that Simplex-PI terminates after at most (O) over tilde (n(2)m(2)tau(t)tau(r)) iterations. This extends a recent result for deterministic MDPs by Post and Ye [Post I, Ye Y (2013) The simplex method is strongly polynomial for deterministic Markov decision processes. Khanna S, ed. Proc. 24th ACM-SIAM Sympos. Discrete Algorithms, SODA '13 (SIAM, Philadelphia), 1465-1473.] in which tau(t) <= 1 and tau(r) <= n; in particular it shows that Simplex-PI is strongly polynomial for a much larger class of MDPs. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively, states that are transient and recurrent for all policies, we show that both Howard's PI and Simplex-PI terminate after at most (O) over tilde (m(n(2)tau(t) + n tau(r))) iterations.
来源URL: