您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > Operations Research > 2022 > 6期

Nonasymptotic Analysis of Monte Carlo Tree Search

成果类型：

Article

署名作者：

Shah, Devavrat; Xie, Qiaomin; Xu, Zhi

署名单位：

Massachusetts Institute of Technology (MIT); University of Wisconsin System; University of Wisconsin Madison

刊物名称：

OPERATIONS RESEARCH

ISSN/ISSBN：

0030-364X

DOI：

10.1287/opre.2021.2239

发表日期：

2022

页码：

3234-3260

关键词：

sampling algorithm rates game go

摘要：

In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinite-horizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior works, uses logarithmic bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multiarm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent nonstationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of nonstationary MABs. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has a claimed property. Interestingly enough, empirically successful approaches use a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a policy improvement operator; that is, it iteratively improves value function approximation for all states because of combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an epsilon approximation of the value function with respect to l(infinity) norm, MCTS combined with nearest neighbor requires a sample size scaling as (O) over tilde(epsilon(-(d+4))), where d is the dimension of the state space. This is nearly optimal because of a minimax lower bound of (Omega) over tilde(epsilon(-(d+2))), suggesting the strength of the variant of MCTS we propose here and our resulting analysis.

来源URL：

访问原文