您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > IEEE Transactions on Automatic Control > 2021 > 11期

Learning Optimal Controllers for Linear Systems With Multiplicative Noise via Policy Gradient

成果类型：

Article

署名作者：

Gravell, Benjamin; Esfahani, Peyman Mohajerin; Summers, Tyler

署名单位：

University of Texas System; University of Texas Dallas; Delft University of Technology

刊物名称：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL

ISSN/ISSBN：

0018-9286

DOI：

10.1109/TAC.2020.3037046

发表日期：

2021

页码：

5283-5298

关键词：

robustness stability analysis CONVERGENCE uncertainty Covariance matrices Additive noise Stochastic processes gradient methods noise optimal control Reinforcement Learning Stochastic systems uncertain systems

摘要：

The linear quadratic regulator (LQR) problem has reemerged as an important theoretical benchmark for reinforcement learning-based control of complex dynamical systems with continuous state and action spaces. In contrast with nearly all recent work in this area, we consider multiplicative noise models, which are increasingly relevant because they explicitly incorporate inherent uncertainty and variation in the system dynamics and thereby improve robustness properties of the controller. Robustness is a critical and poorly understood issue in reinforcement learning; existing methods which do not account for uncertainty can converge to fragile policies or fail to converge at all. Additionally, intentional injection of multiplicative noise into learning algorithms can enhance robustness of policies, as observed in ad hoc work on domain randomization. Although policy gradient algorithms require optimization of a nonconvex cost function, we show that the multiplicative noise LQR cost has a special property called gradient domination, which is exploited to prove global convergence of policy gradient algorithms to the globally optimum control policy with polynomial dependence on problem parameters. Results are provided both in the model-known and model-unknown settings where samples of system trajectories are used to estimate policy gradients