Word-Level Maximum Mean Discrepancy Regularization for Word Embedding

成果类型:
Article; Early Access
署名作者:
Gao, Youqian; Dai, Ben
署名单位:
Chinese University of Hong Kong
刊物名称:
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
ISSN/ISSBN:
0162-1459
DOI:
10.1080/01621459.2025.2547978
发表日期:
2025
关键词:
Classification regression
摘要:
The technique of word embedding is widely used in natural language processing (NLP) to represent words as numerical vectors in textual datasets. However, the estimation of word embedding may suffer from severe overfitting due to the huge variety of words. To address the issue, this article proposes a novel regularization framework that recognizes and accounts for the word-level distribution discrepancy-a common phenomenon in a range of NLP tasks where word distributions are noticeably disparate under different labels. The proposed regularization, referred to as word-level MMD (wMMD ), is a variant of maximum mean discrepancy (MMD) that serves a specific purpose: to enhance/preserve the distribution discrepancies within word embedding numerical vectors and thus prevent overfitting. Our theoretical analysis illustrates that wMMD can effectively operate as a dimension reduction technique of word embedding, thereby significantly improving the robustness and generalization of NLP models. The numerical effectiveness of wMMD and its variants is demonstrated in various simulated examples, CE-T1 and BBC News datasets with state-of-the-art NLP deep learning architectures. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
来源URL: