您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 管理科学与工程 > Management Science > 2025

Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

成果类型：

Article; Early Access

署名作者：

Fu, Zuyue; Qi, Zhengling; Yang, Zhuoran; Wang, Zhaoran; Wang, Lan

署名单位：

Northwestern University; George Washington University; Yale University; University of Miami

刊物名称：

MANAGEMENT SCIENCE

ISSN/ISSBN：

0025-1909

DOI：

10.1287/mnsc.2022.04112

发表日期：

2025

关键词：

two-player turn-based game with private information instrumental variables Offline Reinforcement Learning

摘要：

Motivated by the human-machine interaction such as recommending videos for improving customer engagement, we study human-guided human-machine interaction for decision making with private information. We model this interaction as a two-player turnbased game, where one player (Bob, a human) guides the other player (Alice, a machine) toward a common goal. Specifically, we focus on offline reinforcement learning (RL) in this game, where the goal is to find a policy pair for Alice and Bob that maximizes their expected total rewards based on an offline data set collected a priori. The offline setting presents two challenges: (i) We cannot collect Bob's private information, leading to a confounding bias when using standard RL methods, and (ii) there is a distributional mismatch between the behavior policy used to collect data and the desired optimal policy we aim to learn. To tackle the confounding bias, we treat Bob's previous action as an instrumental variable for Alice's current decision making to adjust for the unmeasured confounding. We establish a novel identification result and propose a new off-policy evaluation (OPE) method for evaluating policy pairs in this two-player turn-based game. To tackle the distributional mismatch, we leverage the idea of pessimism and use our OPE method to develop an off-policy policy learning algorithm for finding a desirable policy pair for both Alice and Bob. Moreover, we prove that under some technical assumptions, the policy pair obtained through our method converges to the optimal one at a satisfactory rate. Finally, we conduct a simulation study to demonstrate the performance of the proposed method.

来源URL：

访问原文