您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > IEEE Transactions on Automatic Control > 2024 > 4期

Adaptive KL-UCB Based Bandit Algorithms for Markovian and IID Settings

成果类型：

Article

署名作者：

Roy, Arghyadip; Shakkottai, Sanjay; Srikant, R.

署名单位：

Indian Institute of Technology System (IIT System); Indian Institute of Technology (IIT) - Guwahati; University of Texas System; University of Texas Austin; University of Illinois System; University of Illinois Urbana-Champaign; University of Illinois System; University of Illinois Urbana-Champaign

刊物名称：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL

ISSN/ISSBN：

0018-9286

DOI：

10.1109/TAC.2023.3336835

发表日期：

2024

页码：

2637-2644

关键词：

Markov processes Hidden Markov models Upper bound STANDARDS indexes tv Motion pictures Kullback-Leibler upper confidence bound (KL-UCB) multi-armed bandit (MAB) online learning regret rested bandit

摘要：

In the regret-based formulation of multi-armed Bandit (MAB) problems, except in rare instances, much of the literature focuses on arms with independent and identically distributed (i.i.d.) rewards. In this article, we consider the problem of obtaining regret guarantees for MAB problems, in which the rewards of each arm form a Markov chain that may not belong to a single parameter exponential family. To achieve a logarithmic regret in such problems is not difficult: a variation of standard Kullback-Leibler upper confidence bound (KL-UCB) does the job. However, the constants obtained from such an analysis are poor for the following reason: i.i.d. rewards are a special case of Markov rewards and it is difficult to design an algorithm that works well independent of whether the underlying model is truly Markovian or i.i.d. To overcome this issue, we introduce a novel algorithm that identifies whether the rewards from each arm are truly Markovian or i.i.d. using a total variation distance-based test. Our algorithm then switches from using a standard KL-UCB to a specialized version of KL-UCB when it determines that the arm reward is Markovian, thus resulting in low regrets for both i.i.d. and Markovian settings.