Nonasymptotic Analysis of Monte Carlo Tree Search
成果类型:
Article
署名作者:
Shah, Devavrat; Xie, Qiaomin; Xu, Zhi
署名单位:
Massachusetts Institute of Technology (MIT); University of Wisconsin System; University of Wisconsin Madison
刊物名称:
OPERATIONS RESEARCH
ISSN/ISSBN:
0030-364X
DOI:
10.1287/opre.2021.2239
发表日期:
2022
页码:
3234-3260
关键词:
sampling algorithm
rates
game
go
摘要:
In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinite-horizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior works, uses logarithmic bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multiarm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent nonstationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of nonstationary MABs. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has a claimed property. Interestingly enough, empirically successful approaches use a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a policy improvement operator; that is, it iteratively improves value function approximation for all states because of combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an epsilon approximation of the value function with respect to l(infinity) norm, MCTS combined with nearest neighbor requires a sample size scaling as (O) over tilde(epsilon(-(d+4))), where d is the dimension of the state space. This is nearly optimal because of a minimax lower bound of (Omega) over tilde(epsilon(-(d+2))), suggesting the strength of the variant of MCTS we propose here and our resulting analysis.
来源URL: