Robust detection of watermarks for large language models under human edits
成果类型:
Article; Early Access
署名作者:
Li, Xiang; Ruan, Feng; Wang, Huiyuan; Long, Qi; Su, Weijie J.
署名单位:
University of Pennsylvania; Northwestern University; University of Pennsylvania
刊物名称:
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
ISSN/ISSBN:
1369-7412
DOI:
10.1093/jrsssb/qkaf056
发表日期:
2025
关键词:
higher criticism
statistics
tests
摘要:
Watermarking is an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modelling human edits through mixture model detection, we introduce a new method-a truncated goodness-of-fit test (Tr-GoF) for detecting watermarked text under human edits. We prove that Tr-GoF achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality adaptively without requiring precise knowledge of human edit levels or probabilistic specifications of LLMs, unlike the optimal but impractical Neyman-Pearson likelihood ratio test. Moreover, we establish that Tr-GoF attains the highest detection efficiency rate under moderate text modifications. In contrast, sum-based detection rules used by existing methods fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. We demonstrate Tr-GoF's competitive and sometimes superior performance on synthetic data and open-source LLMs in the OPT and LLaMA families.
来源URL: