您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 统计学 > The Annals of Applied Statistics > 2024 > 2期

READABILITY PREDICTION: HOW MANY FEATURES ARE NECESSARY?

成果类型：

Article

署名作者：

Schwendinger, Florian; Vana, Laura; Hornik, Kurt

署名单位：

University of Klagenfurt; Technische Universitat Wien; Vienna University of Economics & Business

刊物名称：

ANNALS OF APPLIED STATISTICS

ISSN/ISSBN：

1932-6157

DOI：

10.1214/23-AOAS1820

发表日期：

2024

页码：

1010-1034

关键词：

VARIABLE SELECTION REGRESSION SHRINKAGE models regularization

摘要：

Traditionally, readability prediction has relied on readability formulas, which are based on shallow text characteristics such as average word and sentence length. With recent advances in text mining and natural language processing, more complex text properties can be incorporated into readability prediction models, with papers in the literature suggesting to use up to 200 features for predicting text readability. However, many of the features generated using natural language processing tools are highly correlated and can be thought to measure similar latent text properties. When dealing with a high -dimensional space of correlated features, removing the redundant variables has two advantages: (1) improving interpretability and (2) increasing the predictive power of the model. In this paper we propose an ordinal version of the averaged lasso, which combines hierarchical clustering with the lasso, in order to identify relevant features for readability prediction. We illustrate the approach on two corpora and show improved prediction accuracy when benchmarking against a set of competing models. The annotated corpora as well as the steps necessary for feature creation are freely available as R packages, thus allowing the obtained results to be directly incorporated into a readability estimation pipeline.

来源URL：

访问原文