Don't Follow Reinforcement Learning Blindly: Lower Sample Complexity of Learning Optimal Inventory Control Policies with Fixed Ordering Cost
成果类型:
Article; Early Access
署名作者:
Fan, Xiaoyu; Chen, Boxiao; Olsen, Tava; Qin, Hanzhang; Zhou, Zhengyuan
署名单位:
New York University; University of Illinois System; University of Illinois Chicago; University of Illinois Chicago Hospital; University of Melbourne; National University of Singapore
刊物名称:
PRODUCTION AND OPERATIONS MANAGEMENT
ISSN/ISSBN:
1059-1478
DOI:
10.1177/10591478251378851
发表日期:
2025
关键词:
Offline Learning
Sample Complexity
INVENTORY CONTROL
Fixed Ordering Cost
摘要:
We study the sample complexity of offline learning for a class of structured Markov decision processes (MDPs) describing the inventory control system with fixed ordering cost/setup cost, a fundamental problem in supply chains. We find that a naive plug-in sampling-based approach applied to the inventory MDPs leads to strictly lower sample complexity bounds compared to the optimal bounds recently obtained for the general MDPs. More specifically, in the infinite-horizon discounted cost setting, we obtain an (O) over tilde (min{((S) over bar -s)(2)/(1-gamma)(2)epsilon(2),1/(1-gamma)(4)epsilon(2)) sample complexity bound, where ((S) over bar -s)(2) corresponds to the number of state-action pairs in a generic MDP with state space S and action space A . As such, (O) over tilde(((S) over bar -s)(2)/(1-gamma)(2)epsilon(2)) improves on the optimal generic reinforcement learning (RL) bound (Theta) over tilde(((S) over bar -s)(2)(1-gamma)(3)epsilon(2)) (when directly applying (Theta) over tilde(|S| |A|(1-gamma )(3)epsilon(2)) here) by a factor of (1- gamma)(- 1) , and (O) over tilde (1(1-gamma )(4)epsilon(2) ) is able to completely remove the dependence on state and action cardinality. In the infinite-horizon average cost setting, we obtain an (O) over tilde(((S) over bar -s) (2)epsilon(2)) bound, improving on the generic optimal RL bound (Theta) over tilde((((S) over bar-(s) under bar)(2) t(mix)/epsilon(2)) (when directly applying Theta (|S||A| t(mix)/epsilon(2)) here) by a factor of t(mix), and hence removing the mixing time dependence. By carefully leveraging the structural properties of the inventory dynamics in various settings, we are able to improve on those best-possible bounds developed in the RL literature. Our results demonstrate the drawbacks one could face by blindly following RL algorithms and the necessity of designing sample efficient algorithms that properly incorporate the special structures of the inventory systems.
来源URL: