HANDLING CATEGORICAL FEATURES WITH MANY LEVELS USING A PRODUCT PARTITION MODEL

成果类型:
Article
署名作者:
Criscuolo, Tulio L.; Assuncao, Renato M.; Loschi, Rosangela H.; Meira Jr, Wagner; Cruz-Reyes, Danna
署名单位:
Universidade Federal de Minas Gerais; Environmental Systems Research Institute, Inc. (ESRI); Universidade Federal de Minas Gerais; National University of Rosario
刊物名称:
ANNALS OF APPLIED STATISTICS
ISSN/ISSBN:
1932-6157
DOI:
10.1214/22-AOAS1651
发表日期:
2023
页码:
786-814
关键词:
selection predictors
摘要:
A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. Few proposals have been developed to tackle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. We represent the categorical predictor by a graph where the nodes are the categories and establish a probability distribution over meaningful partitions of this graph. Conditionally on the observed data, we obtain a posterior distribution for the levels aggregation, allowing the inference about the most probable clustering for the categories. Simultaneously, we extract inference about all the other regression model parameters. We compare our and state-of-art methods showing that it has equally good predictive performance and more interpretable results. Our approach balances out accuracy vs. interpretability, a current important concern in statistics and machine learning.
来源URL: