您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > Mathematics of Operations Research > 2025

Upper Bounds for All and Max-Gain Policy Iteration Algorithms on Deterministic MDPs

成果类型：

Article; Early Access

署名作者：

Goenka, Ritesh; Gupta, Eashan; Khyalia, Sushil; Kalyanakrishnan, Shivaram

署名单位：

University of Oxford; University of Illinois System; University of Illinois Urbana-Champaign; Carnegie Mellon University; Indian Institute of Technology System (IIT System); Indian Institute of Technology (IIT) - Bombay

刊物名称：

MATHEMATICS OF OPERATIONS RESEARCH

ISSN/ISSBN：

0364-765X

DOI：

10.1287/moor.2023.0317

发表日期：

2025

关键词：

maximum number nonnegative matrices cycles complexity simplex complement Digraphs graphs

摘要：

Policy iteration (PI) is a widely used family of algorithms to compute optimal policies for Markov decision problems (MDPs). Howard's [Howard RA (1960) Dynamic Programming and Markov Processes (MIT Press, Cambridge, MA)] PI is one of the most commonly used algorithms from this family. Despite its popularity, theoretical analysis of the running-time complexity of Howard's PI has remained elusive. For n-state, two-action MDPs, the best known lower and upper bounds are ohm(n) and O(2n/n) iterations, respectively. Based on computational evidence for a combinatorial relaxation of this problem, Hansen [Hansen TD (2012) Worst-case analysis of strategy iteration and the simplex method. Unpublished PhD thesis, Aarhus University, Aarhus, Denmark] conjectured that root ffififfi the upper bound can be improved to O(phi n), where phi = (1 + 5 )/2 is the golden ratio. We prove this conjecture for deterministic MDPs (DMDPs), albeit up to a poly(n) factor. More generally, we derive a nontrivial upper bound for DMDPs that applies to the entire family of PI algorithms. We also derive an improved bound that applies to all max-gain switching variants. These bounds hold both under discounted and average reward settings. Combined with a result of Melekopoglou and Condon [Melekopoglou M, Condon A (1994) On the complexity of the policy improvement algorithm for Markov decision processes. ORSA J. Comput. 6(2):188-192], our results imply that stochasticity makes two-action MDPs harder to solve for PI. Our analysis is based on certain graph-theoretic results, which may be of independent interest.

来源URL：

访问原文