您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > Mathematical Programming > 2024 > 1-2期

Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity

成果类型：

Article

署名作者：

Li, Yan; Lan, Guanghui; Zhao, Tuo

署名单位：

University System of Georgia; Georgia Institute of Technology

刊物名称：

MATHEMATICAL PROGRAMMING

ISSN/ISSBN：

0025-5610

DOI：

10.1007/s10107-023-02017-4

发表日期：

2024

页码：

457-513

关键词：

path

摘要：

We study a new variant of policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. When instantiated with Kullback-Leibler divergence, we establish the global linear convergence of HPMD applied to any MDP instance, for both the optimality gap, and a weighted distance to the set of optimal policies. We then unveil a phase transition, where both quantities exhibit local acceleration, and converge at a superlinear rate after the optimality gap falls below certain instance-dependent threshold. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a byproduct, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence prop-erties. By exploiting the local acceleration, we show that when a generative model is available for policy evaluation, with a small enough epsilon(0), for any target precision epsilon <= epsilon(0), an epsilon-optimal policy can be learned with (O) over tilde(vertical bar S parallel to A vertical bar/epsilon(2)(0)) samples with probability 1 - O(epsilon(1/3)(0)).