A DUAL-DICTIONARY MODEL FOR MINING DOMAIN-SPECIFIC CHINESE TEXTS

成果类型:
Article
署名作者:
Xu, Jiaze; Pand, Changzai; Deng, Ke
署名单位:
Tsinghua University; Tsinghua University; Tsinghua University
刊物名称:
ANNALS OF APPLIED STATISTICS
ISSN/ISSBN:
1932-6157
DOI:
10.1214/25-AOAS2035
发表日期:
2025
页码:
1147-1166
关键词:
maximum-likelihood
摘要:
Processing domain-specific Chinese texts is one of the most challenging tasks in natural language processing (NLP) due to the unique characteristics of Chinese and specific challenges in processing domain-specific texts. Existing NLP methods, based on supervised learning or large language models, often suffer from unstable performance in processing domain-specific Chinese texts. In this study we propose a novel statistical approach called TopWORDS-MEPA (Abbreviated as TWM) that can achieve high-quality meta-pattern discovery, named entity recognition, text segmentation, and relation extraction simultaneously from unannotated target texts with little training information. Simulation studies and real data applications confirm that TopWORDS-MEPA is a powerful alternative of existing NLP methods for processing domain-specific Chinese texts that enjoys competitive performance, transparent interpretation, low training and computing costs, and efficient utilization of domain knowledge for processing domain-specific Chinese texts.
来源URL: