您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > Operations Research > 2020 > 5期

Nonstationary Bandits with Habituation and Recovery Dynamics

成果类型：

Article

署名作者：

Mintz, Yonatan; Aswani, Anil; Kaminsky, Philip; Flowers, Elena; Fukuoka, Yoshimi

署名单位：

University System of Georgia; Georgia Institute of Technology; University of California System; University of California Berkeley; University of California System; University of California San Francisco; University of California System; University of California San Francisco

刊物名称：

OPERATIONS RESEARCH

ISSN/ISSBN：

0030-364X

DOI：

10.1287/opre.2019.1918

发表日期：

2020

页码：

1493-1516

关键词：

multiarmed bandit physical-activity weight-loss allocation intervention algorithm demand cancer optimization policies

摘要：

Many settings involve sequential decision making where a set of actions can be chosen at each time step, each action provides a stochastic reward, and the distribution for the reward provided by each action is initially unknown. However, frequent selection of a specific action may reduce the expected reward for that action, whereas abstaining from choosing an action may cause its expected reward to increase. Such nonstationary phenomena are observed in many real-world settings such as personalized healthcare adherence-improving interventions and targeted online advertising. Though finding an optimal policy for general models with nonstationarity is PSPACE-complete, we propose and analyze a new class of models called reducing or gaining unknown efficacy (ROGUE) bandits, which we show in this paper can capture these phenomena and are amenable to the design of policies with provable properties. We first present a consistent maximum likelihood approach to estimate the parameters of these models and conduct a statistical analysis to construct finite sample concentration bounds. Using this analysis, we develop and analyze two different algorithms for optimizing ROGUE models: an upper confidence bound algorithm (ROGUE-UCB) and an is an element of-greedy algorithm (is an element of-ROGUE). Our theoretical analysis shows that under proper conditions, the ROGUE-UCB and is an element of-ROGUE algorithms can achieve logarithmic in time regret, unlike existing algorithms, which result in linear regret. We conclude with a numerical experiment using real-world data from a personalized healthcare adherence-improving intervention to increase physical activity. In this intervention, the goal is to optimize the selection of messages (e.g., confidence increasing versus knowledge increasing) to send to each individual each day to increase adherence and physical activity. Our results show that ROGUE-UCB and is an element of-ROGUE perform better in terms of aggregated regret and average reward when compared with state-of-the-art algorithms, and in the context of this intervention, the use of ROGUE-UCB increases daily step counts by roughly 1,000 steps a day (about a half-mile more of walking) comparedwith other algorithms in a simulation experiment.

来源URL：

访问原文