Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity
成果类型:
Article
署名作者:
Li, Yan; Lan, Guanghui; Zhao, Tuo
署名单位:
University System of Georgia; Georgia Institute of Technology
刊物名称:
MATHEMATICAL PROGRAMMING
ISSN/ISSBN:
0025-5610
DOI:
10.1007/s10107-023-02017-4
发表日期:
2024
页码:
457-513
关键词:
path
摘要:
We study a new variant of policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. When instantiated with Kullback-Leibler divergence, we establish the global linear convergence of HPMD applied to any MDP instance, for both the optimality gap, and a weighted distance to the set of optimal policies. We then unveil a phase transition, where both quantities exhibit local acceleration, and converge at a superlinear rate after the optimality gap falls below certain instance-dependent threshold. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a byproduct, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence prop-erties. By exploiting the local acceleration, we show that when a generative model is available for policy evaluation, with a small enough epsilon(0), for any target precision epsilon <= epsilon(0), an epsilon-optimal policy can be learned with (O) over tilde(vertical bar S parallel to A vertical bar/epsilon(2)(0)) samples with probability 1 - O(epsilon(1/3)(0)).