Does AI help humans make better decisions? A statistical evaluation framework for experimental and observational studies

成果类型:
Article
署名作者:
Ben-Michael, Eli; Greiner, D. James; Huang, Melody; Imai, Kosuke; Jiang, Zhichao; Shin, Sooahn
署名单位:
Carnegie Mellon University; Carnegie Mellon University; Harvard University; Yale University; Yale University; Harvard University; Harvard University; Sun Yat Sen University
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-14483
DOI:
10.1073/pnas.2505106122
发表日期:
2025-09-23
关键词:
摘要:
The use of AI, or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a methodological framework to answer this question empirically with minimal assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded and unconfounded treatment assignment, in which the provision of AI-generated recommendations is assumed to be randomized across cases, conditional on observed covariates, with final decisions made by humans. Under this study design, we show how to compare the performance of three alternative decision-making systems-human-alone, human-with-AI, and AI-alone. Importantly, the AI-alone system encompasses any individualized treatment assignment, including those not used in the original study. We also show when AI recommendations should be provided to a human-decision maker, and when one should follow such recommendations. We apply the proposed methodology to our own randomized controlled trial evaluating a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Furthermore, replacing a human judge with algorithms-the risk assessment score and a large language model in particular-yields worse classification performance.