Evaluating large language models in theory of mind tasks
成果类型:
Article
署名作者:
Kosinski, Michal
署名单位:
Stanford University
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-9046
DOI:
10.1073/pnas.2405460121
发表日期:
2024-11-05
关键词:
false-belief
neural-networks
metaanalysis
PERSPECTIVE
childrens
emergence
infants
others
sense
摘要:
Eleven large language models (LLMs) were assessed using 40 bespoke false- belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false- belief scenario, three closely matched true- belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre- trained Transformer (GPT)- 3- davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6- y- old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by- product of LLMs' improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
来源URL: