您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > IEEE Transactions on Automatic Control > 2023 > 3期

Safe Policies for Reinforcement Learning via Primal-Dual Methods

成果类型：

Article

署名作者：

Paternain, Santiago; Calvo-Fullana, Miguel; Chamon, Luiz F. O.; Ribeiro, Alejandro

署名单位：

Massachusetts Institute of Technology (MIT); University of California System; University of California Berkeley; University of Pennsylvania

刊物名称：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL

ISSN/ISSBN：

0018-9286

DOI：

10.1109/TAC.2022.3152724

发表日期：

2023

页码：

1321-1336

关键词：

safety trajectory Reinforcement Learning Task analysis optimal control optimization Markov processes Autonomous systems gradient methods unsupervised learning

摘要：

In this article, we study the design of controllers in the context of stochastic optimal control under the assumption that the model of the system is not available. This is, we aim to control a Markov decision process of which we do not know the transition probabilities, but we have access to sample trajectories through experience. We define safety as the agent remaining in a desired safe set with high probability during the operation time. The drawbacks of this formulation are twofold. The problem is nonconvex and computing the gradients of the constraints with respect to the policies is prohibitive. Hence, we propose an ergodic relaxation of the constraints with the following advantages. 1) The safety guarantees are maintained in the case of episodic tasks and they hold until a given time horizon for continuing tasks. 2) The constrained optimization problem despite its nonconvexity has arbitrarily small duality gap if the parametrization of the controller is rich enough. 3) The gradients of the Lagrangian associated with the safe learning problem can be computed using standard reinforcement learning results and stochastic approximation tools. Leveraging these advantages, we exploit primal-dual algorithms to find policies that are safe and optimal. We test the proposed approach in a navigation task in a continuous domain. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.

来源URL：

访问原文