Upper Bounds for All and Max-Gain Policy Iteration Algorithms on Deterministic MDPs
成果类型:
Article; Early Access
署名作者:
Goenka, Ritesh; Gupta, Eashan; Khyalia, Sushil; Kalyanakrishnan, Shivaram
署名单位:
University of Oxford; University of Illinois System; University of Illinois Urbana-Champaign; Carnegie Mellon University; Indian Institute of Technology System (IIT System); Indian Institute of Technology (IIT) - Bombay
刊物名称:
MATHEMATICS OF OPERATIONS RESEARCH
ISSN/ISSBN:
0364-765X
DOI:
10.1287/moor.2023.0317
发表日期:
2025
关键词:
maximum number
nonnegative matrices
cycles
complexity
simplex
complement
Digraphs
graphs
摘要:
Policy iteration (PI) is a widely used family of algorithms to compute optimal policies for Markov decision problems (MDPs). Howard's [Howard RA (1960) Dynamic Programming and Markov Processes (MIT Press, Cambridge, MA)] PI is one of the most commonly used algorithms from this family. Despite its popularity, theoretical analysis of the running-time complexity of Howard's PI has remained elusive. For n-state, two-action MDPs, the best known lower and upper bounds are ohm(n) and O(2n/n) iterations, respectively. Based on computational evidence for a combinatorial relaxation of this problem, Hansen [Hansen TD (2012) Worst-case analysis of strategy iteration and the simplex method. Unpublished PhD thesis, Aarhus University, Aarhus, Denmark] conjectured that root ffififfi the upper bound can be improved to O(phi n), where phi = (1 + 5 )/2 is the golden ratio. We prove this conjecture for deterministic MDPs (DMDPs), albeit up to a poly(n) factor. More generally, we derive a nontrivial upper bound for DMDPs that applies to the entire family of PI algorithms. We also derive an improved bound that applies to all max-gain switching variants. These bounds hold both under discounted and average reward settings. Combined with a result of Melekopoglou and Condon [Melekopoglou M, Condon A (1994) On the complexity of the policy improvement algorithm for Markov decision processes. ORSA J. Comput. 6(2):188-192], our results imply that stochasticity makes two-action MDPs harder to solve for PI. Our analysis is based on certain graph-theoretic results, which may be of independent interest.
来源URL: