您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > IEEE Transactions on Automatic Control > 2023 > 3期

Expedited Online Learning With Spatial Side Information

成果类型：

Article

署名作者：

Thangeda, Pranay; Ornik, Melkior; Topcu, Ufuk

署名单位：

University of Illinois System; University of Illinois Urbana-Champaign; University of Illinois System; University of Illinois Urbana-Champaign; University of Texas System; University of Texas Austin; University of Texas System; University of Texas Austin

刊物名称：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL

ISSN/ISSBN：

0018-9286

DOI：

10.1109/TAC.2022.3153278

发表日期：

2023

页码：

1479-1491

关键词：

Heuristic algorithms safety Vehicle dynamics Bayes methods aerodynamics optimal control Markov processes Markov decision processes (MDPs) online learning PLANNING side information

摘要：

The applicability of model-based online reinforcement learning algorithms is often limited by the amount of exploration required for learning the environment model to the desired level of accuracy. A promising approach to addressing this issue is to exploit side information, available either a priori or during the agent's mission, for learning the unknown dynamics. Side information in our context refers to information in the form of bounds on the differences between transition probabilities at different states in the environment. We use this information as a measure of reusability of the direct experience gained by performing actions and observing the outcomes at different states. We propose a framework to integrate side information into existing model-based reinforcement learning algorithms by complementing the samples obtained directly at states with second-hand information obtained from other states with similar dynamics. Additionally, we propose an algorithm for synthesizing the optimal control strategy in unknown environments by using side information to effectively balance between exploration and exploitation. We prove that, with high probability, the proposed algorithm yields a near-optimal policy in the Bayesian sense, while also guaranteeing the safety of the agent during exploration. We obtain the near-optimal policy in time steps that are polynomial in terms of the parameters describing the model. We illustrate the utility of the proposed algorithms in a setting of a Mars rover, with data from onboard sensors and a companion aerial vehicle acting as the side information.