Approximation Benefits of Policy Gradient Methods with Aggregated States
成果类型:
Article
署名作者:
Russo, Daniel
署名单位:
Columbia University
刊物名称:
MANAGEMENT SCIENCE
ISSN/ISSBN:
0025-1909
DOI:
10.1287/mnsc.2023.4788
发表日期:
2023
页码:
6898-6911
关键词:
Reinforcement learning
Approximate Dynamic Programming
policy gradient methods
state aggregation
摘要:
Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, in which the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per period is bounded by epsilon, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as epsilon/(1-gamma), where. is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision objective can be far more robust.