-
作者:Shah, Devavrat; Xie, Qiaomin; Xu, Zhi
作者单位:Massachusetts Institute of Technology (MIT); University of Wisconsin System; University of Wisconsin Madison
摘要:In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinite-horizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior work...
-
作者:Kallus, Nathan; Uehara, Masatoshi
作者单位:Cornell University
摘要:Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds and efficient influence functions for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible ...
-
作者:Qu, Guannan; Wierman, Adam; Li, Na
作者单位:Carnegie Mellon University; California Institute of Technology; Harvard University
摘要:We study reinforcement learning (RL) in a setting with a network of agents whose states and actions interact in a local manner where the objective is to find localized policies such that the (discounted) global reward is maximized. A fundamental challenge in this setting is that the state-action space size scales exponentially in the number of agents, rendering the problem intractable for large networks. In this paper, we propose a scalable actor critic (SAC) framework that exploits the networ...
-
作者:Hu, Yichun; Kallus, Nathan; Mao, Xiaojie
作者单位:Cornell University; Tsinghua University
摘要:We study a nonparametric contextual bandit problem in which the expected reward functions belong to a Holder class with smoothness parameter beta. We showhowthis interpolates between two extremes that were previously studied in isolation: nondifferentiable bandits (beta at most 1), with which rate-optimal regret is achieved by running separate noncontextual bandits in different context regions, and parametric-response bandits (infinite beta), with which rate-optimal regret can be achieved with...
-
作者:Keppo, Jussi; Kim, Michael Jong; Zhang, Xinyuan
作者单位:National University of Singapore; National University of Singapore; University of British Columbia
摘要:We study optimal manipulation of a Bayesian learner through adaptive provisioning of information. The problem is motivated by settings in which a firm can disseminate possibly biased information at a cost, to influence the public's belief about a hidden parameter related to the firm's payoffs. For example, firms advertise to sell products. We study a sequential optimizationmodel in which the firmdynamically decides on the quantity and content of information sent to the public, aiming to maximi...
-
作者:Eckman, David J.; Plumlee, Matthew; Nelson, Barry L.
作者单位:Texas A&M University System; Texas A&M University College Station; Northwestern University
摘要:When working with models that allow for many candidate solutions, simulation practitioners can benefit fromscreening out unacceptable solutions in a statistically controlled way. However, for large solution spaces, estimating the performance of all solutions through simulation can prove impractical. We propose a statistical framework for screening solutions even when only a relatively small subset of them is simulated. Our framework derives its superiority over exhaustive screening approaches ...
-
作者:Jiang, Nan; Xie, Weijun
作者单位:Virginia Polytechnic Institute & State University
摘要:In a chance constrained program (CCP), decision makers seek the best decision whose probability of violating the uncertainty constraints is within the prespecified risk level. As a CCP is often nonconvex and is difficult to solve to optimality, much effort has been devoted to developing convex inner approximations for a CCP, among which the conditional value-at-risk (CVaR) has been known to be the best for more than a decade. This paper studies and generalizes the ALSO-X, originally proposed b...
-
作者:Wang, Yining; Wang, He
作者单位:State University System of Florida; University of Florida; University System of Georgia; Georgia Institute of Technology
摘要:Price-based revenue management is an important problem in operations management with many practical applications. The problemconsiders a sellerwho sells one ormultiple products over T consecutive periods and is subject to constraints on the initial inventory levels of resources. Whereas, in theory, the optimal pricing policy could be obtained via dynamic programming, computing the exact dynamic programming solution is often intractable. Approximate policies, such as the resolving heuristics, a...
-
作者:Gur, Yonatan; Momeni, Ahmadreza; Wager, Stefan
作者单位:Stanford University; Stanford University
摘要:We study a nonparametric multiarmed bandit problem with stochastic covariates, where a key complexity driver is the smoothness of payoff functions with respect to covariates. Previous studies have focused on deriving minimax-optimal algorithms in cases where it is a priori known how smooth the payoff functions are. In practice, however, the smoothness of payoff functions is typically not known in advance, and misspecification of smoothness may severely deteriorate the performance of existing m...
-
作者:Blanchet, Jose H.; Reiman, Martin I.; Shah, Virag; Wein, Lawrence M.; Wu, Linjia
作者单位:Stanford University; Columbia University; Stanford University
摘要:We consider a matching market where buyers and sellers arrive according to independent Poisson processes at the same rate and independently abandon the market if not matched after an exponential amount of time with the same mean. In this centralized market, the utility for the system manager from matching any buyer and any seller is a general random variable. We consider a sequence of systems indexed by n where the arrivals in the nth system are sped up by a factor of n. We analyze two familie...