您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 运营管理 > Production and Operations Management > 2025

Don't Follow Reinforcement Learning Blindly: Lower Sample Complexity of Learning Optimal Inventory Control Policies with Fixed Ordering Cost

成果类型：

Article; Early Access

署名作者：

Fan, Xiaoyu; Chen, Boxiao; Olsen, Tava; Qin, Hanzhang; Zhou, Zhengyuan

署名单位：

New York University; University of Illinois System; University of Illinois Chicago; University of Illinois Chicago Hospital; University of Melbourne; National University of Singapore

刊物名称：

PRODUCTION AND OPERATIONS MANAGEMENT

ISSN/ISSBN：

1059-1478

DOI：

10.1177/10591478251378851

发表日期：

2025

关键词：

Offline Learning Sample Complexity INVENTORY CONTROL Fixed Ordering Cost

摘要：

We study the sample complexity of offline learning for a class of structured Markov decision processes (MDPs) describing the inventory control system with fixed ordering cost/setup cost, a fundamental problem in supply chains. We find that a naive plug-in sampling-based approach applied to the inventory MDPs leads to strictly lower sample complexity bounds compared to the optimal bounds recently obtained for the general MDPs. More specifically, in the infinite-horizon discounted cost setting, we obtain an (O) over tilde (min{((S) over bar -s)(2)/(1-gamma)(2)epsilon(2),1/(1-gamma)(4)epsilon(2)) sample complexity bound, where ((S) over bar -s)(2) corresponds to the number of state-action pairs in a generic MDP with state space S and action space A . As such, (O) over tilde(((S) over bar -s)(2)/(1-gamma)(2)epsilon(2)) improves on the optimal generic reinforcement learning (RL) bound (Theta) over tilde(((S) over bar -s)(2)(1-gamma)(3)epsilon(2)) (when directly applying (Theta) over tilde(|S| |A|(1-gamma )(3)epsilon(2)) here) by a factor of (1- gamma)(- 1) , and (O) over tilde (1(1-gamma )(4)epsilon(2) ) is able to completely remove the dependence on state and action cardinality. In the infinite-horizon average cost setting, we obtain an (O) over tilde(((S) over bar -s) (2)epsilon(2)) bound, improving on the generic optimal RL bound (Theta) over tilde((((S) over bar-(s) under bar)(2) t(mix)/epsilon(2)) (when directly applying Theta (|S||A| t(mix)/epsilon(2)) here) by a factor of t(mix), and hence removing the mixing time dependence. By carefully leveraging the structural properties of the inventory dynamics in various settings, we are able to improve on those best-possible bounds developed in the RL literature. Our results demonstrate the drawbacks one could face by blindly following RL algorithms and the necessity of designing sample efficient algorithms that properly incorporate the special structures of the inventory systems.

来源URL：

访问原文