您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 统计学 > Journal of the American Statistical Association > 2025

Word-Level Maximum Mean Discrepancy Regularization for Word Embedding

成果类型：

Article; Early Access

署名作者：

Gao, Youqian; Dai, Ben

署名单位：

Chinese University of Hong Kong

刊物名称：

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION

ISSN/ISSBN：

0162-1459

DOI：

10.1080/01621459.2025.2547978

发表日期：

2025

关键词：

Classification regression

摘要：

The technique of word embedding is widely used in natural language processing (NLP) to represent words as numerical vectors in textual datasets. However, the estimation of word embedding may suffer from severe overfitting due to the huge variety of words. To address the issue, this article proposes a novel regularization framework that recognizes and accounts for the word-level distribution discrepancy-a common phenomenon in a range of NLP tasks where word distributions are noticeably disparate under different labels. The proposed regularization, referred to as word-level MMD (wMMD ), is a variant of maximum mean discrepancy (MMD) that serves a specific purpose: to enhance/preserve the distribution discrepancies within word embedding numerical vectors and thus prevent overfitting. Our theoretical analysis illustrates that wMMD can effectively operate as a dimension reduction technique of word embedding, thereby significantly improving the robustness and generalization of NLP models. The numerical effectiveness of wMMD and its variants is demonstrated in various simulated examples, CE-T1 and BBC News datasets with state-of-the-art NLP deep learning architectures. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

来源URL：

访问原文