Self-Guided Approximate Linear Programs: Randomized Multi-Shot Approximation of Discounted Cost Markov Decision Processes
成果类型:
Article
署名作者:
Pakiman, Parshan; Nadarajah, Selvaprabu; Soheili, Negar; Lin, Qihang
署名单位:
University of Illinois System; University of Illinois Chicago; University of Illinois Chicago Hospital; University of Iowa
刊物名称:
MANAGEMENT SCIENCE
ISSN/ISSBN:
0025-1909
DOI:
10.1287/mnsc.2020.00038
发表日期:
2025
关键词:
approximate linear programming
random features
Markov decision processes
Approximate Dynamic Programming
Reinforcement Learning
INVENTORY CONTROL
options pricing
摘要:
Approximate linear programs (ALPs) are well-known models based on value function approximations (VFAs) to obtain policies and lower bounds on the optimal policy cost of discounted-cost Markov decision processes (MDPs). Formulating an ALP requires (i) basis functions, the linear combination of which defines the VFA, and (ii) a state relevance distribution, which determines the relative importance of different states in the ALP objective for the purpose of minimizing VFA error. Both of these choices are typically heuristic; basis function selection relies on domain knowledge, whereas the state-relevance distribution is specified using the frequency of states visited by a baseline policy. We propose a self-guided sequence of ALPs that embeds random basis functions obtained via inexpensive sampling and uses the known VFA from the previous iteration to guide VFA computation in the current iteration. In other words, this sequence takes multiple shots randomly approximating the MDP value function with VFA-based guidance between consecutive approximation attempts. Self-guided ALPs mitigate domain knowledge during basis function selection and the impact of the state-relevance-distribution choice, thus reducing the ALP implementation burden. We establish high-probability error bounds on the VFAs from this sequence and show that a worst-case measure of policy performance improved. We find that these favorable implementation and theoretical properties translate to encouraging numerical results on perishable inventory control and options pricing applications, where self-guided ALP policies improve upon policies from problem-specific methods. More broadly, our research takes a meaningful step toward application-agnostic policies and bounds for MDPs.