Nonstationary Bandits with Habituation and Recovery Dynamics
成果类型:
Article
署名作者:
Mintz, Yonatan; Aswani, Anil; Kaminsky, Philip; Flowers, Elena; Fukuoka, Yoshimi
署名单位:
University System of Georgia; Georgia Institute of Technology; University of California System; University of California Berkeley; University of California System; University of California San Francisco; University of California System; University of California San Francisco
刊物名称:
OPERATIONS RESEARCH
ISSN/ISSBN:
0030-364X
DOI:
10.1287/opre.2019.1918
发表日期:
2020
页码:
1493-1516
关键词:
multiarmed bandit
physical-activity
weight-loss
allocation
intervention
algorithm
demand
cancer
optimization
policies
摘要:
Many settings involve sequential decision making where a set of actions can be chosen at each time step, each action provides a stochastic reward, and the distribution for the reward provided by each action is initially unknown. However, frequent selection of a specific action may reduce the expected reward for that action, whereas abstaining from choosing an action may cause its expected reward to increase. Such nonstationary phenomena are observed in many real-world settings such as personalized healthcare adherence-improving interventions and targeted online advertising. Though finding an optimal policy for general models with nonstationarity is PSPACE-complete, we propose and analyze a new class of models called reducing or gaining unknown efficacy (ROGUE) bandits, which we show in this paper can capture these phenomena and are amenable to the design of policies with provable properties. We first present a consistent maximum likelihood approach to estimate the parameters of these models and conduct a statistical analysis to construct finite sample concentration bounds. Using this analysis, we develop and analyze two different algorithms for optimizing ROGUE models: an upper confidence bound algorithm (ROGUE-UCB) and an is an element of-greedy algorithm (is an element of-ROGUE). Our theoretical analysis shows that under proper conditions, the ROGUE-UCB and is an element of-ROGUE algorithms can achieve logarithmic in time regret, unlike existing algorithms, which result in linear regret. We conclude with a numerical experiment using real-world data from a personalized healthcare adherence-improving intervention to increase physical activity. In this intervention, the goal is to optimize the selection of messages (e.g., confidence increasing versus knowledge increasing) to send to each individual each day to increase adherence and physical activity. Our results show that ROGUE-UCB and is an element of-ROGUE perform better in terms of aggregated regret and average reward when compared with state-of-the-art algorithms, and in the context of this intervention, the use of ROGUE-UCB increases daily step counts by roughly 1,000 steps a day (about a half-mile more of walking) comparedwith other algorithms in a simulation experiment.
来源URL: