Quality Control for Crowd Workers and for Language Models: A Framework for Free-Text Response Evaluation with No Ground Truth

成果类型:
Article; Early Access
署名作者:
Yahav, Inbal; Goldstein, Anat; Geva, Tomer; Meir, Shahar; Shehory, Onn
署名单位:
Tel Aviv University; Ariel University; Bar Ilan University
刊物名称:
INFORMATION SYSTEMS RESEARCH
ISSN/ISSBN:
1047-7047
DOI:
10.1287/isre.2023.0426
发表日期:
2025
关键词:
摘要:
In recent years, the field of natural language processing has made remarkable progress with the emergence of large language models (LLMs). In particular, the ability of LLMs to provide fact-based, free-text responses to user queries has the potential to revolutionize domains such as online search and the use of informative chatbots. However, extensive validation is required so that the response accuracy of question-answering LLMs can be confidently trusted. This paper introduces a framework to address this challenge: automated quality evaluation based on textual responses (AQER). The AQER framework focuses on two primary tasks: evaluating the quality of individual workers based on their free-text responses given that no ground-truth data are available and assessing the quality of LLM responses given a set of worker-generated responses. AQER is advantageously intuitive, easy to implement, and flexible to accommodate different components. To evaluate AQER's effectiveness, we conducted empirical evaluations using semi-synthetic and real-world question-and-answer data sets as well as stress testing through numerical simulations. We also provide analytical motivation and show method convergence and boundary conditions using the probably approximately correct learning framework. The results demonstrate AQER's robustness in evaluating LLMs and workers, and its superiority over baseline approaches. These findings establish AQER as a benchmark for future research in this field.
来源URL: