Expedited Online Learning With Spatial Side Information

成果类型:
Article
署名作者:
Thangeda, Pranay; Ornik, Melkior; Topcu, Ufuk
署名单位:
University of Illinois System; University of Illinois Urbana-Champaign; University of Illinois System; University of Illinois Urbana-Champaign; University of Texas System; University of Texas Austin; University of Texas System; University of Texas Austin
刊物名称:
IEEE TRANSACTIONS ON AUTOMATIC CONTROL
ISSN/ISSBN:
0018-9286
DOI:
10.1109/TAC.2022.3153278
发表日期:
2023
页码:
1479-1491
关键词:
Heuristic algorithms safety Vehicle dynamics Bayes methods aerodynamics optimal control Markov processes Markov decision processes (MDPs) online learning PLANNING side information
摘要:
The applicability of model-based online reinforcement learning algorithms is often limited by the amount of exploration required for learning the environment model to the desired level of accuracy. A promising approach to addressing this issue is to exploit side information, available either a priori or during the agent's mission, for learning the unknown dynamics. Side information in our context refers to information in the form of bounds on the differences between transition probabilities at different states in the environment. We use this information as a measure of reusability of the direct experience gained by performing actions and observing the outcomes at different states. We propose a framework to integrate side information into existing model-based reinforcement learning algorithms by complementing the samples obtained directly at states with second-hand information obtained from other states with similar dynamics. Additionally, we propose an algorithm for synthesizing the optimal control strategy in unknown environments by using side information to effectively balance between exploration and exploitation. We prove that, with high probability, the proposed algorithm yields a near-optimal policy in the Bayesian sense, while also guaranteeing the safety of the agent during exploration. We obtain the near-optimal policy in time steps that are polynomial in terms of the parameters describing the model. We illustrate the utility of the proposed algorithms in a setting of a Mars rover, with data from onboard sensors and a companion aerial vehicle acting as the side information.