Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management

成果类型:
Article; Early Access
署名作者:
Agrawal, Shipra; Jia, Randy
署名单位:
Columbia University
刊物名称:
OPERATIONS RESEARCH
ISSN/ISSBN:
0030-364X
DOI:
10.1287/opre.2022.2263
发表日期:
2022
关键词:
摘要:
We consider a stochastic inventory control problem under censored demand, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near optimality of a simple class of policies called base-stock policies as well as the convexity of long-run average cost under those policies. We consider a relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound the regret of the algorithm when compared with the best base-stock policy. Our main contribution is a learning algorithm with a regret bound of (O) over tilde ((L + 1)root T + D) for the inventory control problem. Here, L >= 0 is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the expected number of time steps needed to generate enough demand to deplete one unit of inventory. Notably, our regret bounds depend linearly on L, which significantly improves the previously best-known regret bounds for this problem where the dependence on L was exponential. Our techniques utilize the convexity of the long-run average cost and a newly derived bound on the bias of base-stock policies to establish an almost black box connection between the problem of learning inMarkov decision processes (MDPs) with these properties and the stochastic convex bandit problem. The techniques presented heremay be of independent interest for other settings that involve large structuredMDPs butwith convex asymptotic average cost functions.
来源URL: