Explicitly unbiased large language models still form biased associations

成果类型:
Article
署名作者:
Bai, Xuechunzi; Wang, Angelina; Sucholutsky, Ilia; Griffiths, Thomas L.
署名单位:
University of Chicago; Stanford University; New York University; Princeton University; Princeton University
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-14326
DOI:
10.1073/pnas.2416228122
发表日期:
2025-02-18
关键词:
implicit social-cognition gender stereotypes PREJUDICE ability RACE
摘要:
Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: As LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing implicit bias; and LLM Relative Decision Test, a strategy to detect subtle discrimination in contextual decisions. Both measures are based on psychological research: LLM automatic associations between concepts held in human minds; and LLM Relative Decision Test operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories weapons, gender and science, age and negativity). These prompt-based measures draw from psychology's long history of research into measuring stereotypes based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.