Improving Imbalanced Machine Learning with Neighborhood-Informed Synthetic Sample Placement

成果类型:
Article
署名作者:
Nasir, Murtaza; Dag, Ali; Simsek, Serhat; Ivanov, Anton; Oztekin, Asil
署名单位:
Wichita State University; Creighton University; Montclair State University; University of Illinois System; University of Illinois Urbana-Champaign; University of Massachusetts System; University of Massachusetts Lowell
刊物名称:
JOURNAL OF MANAGEMENT INFORMATION SYSTEMS
ISSN/ISSBN:
0742-1222
DOI:
10.1080/07421222.2022.2127453
发表日期:
2022
页码:
1116-1145
关键词:
data analytic approach Social media predictive analytics CLASSIFICATION identification platforms systems IMPACT smote care
摘要:
Machine learning is widely used in information systems design. Yet, training algorithms on imbalanced datasets may severely affect performance on unseen data. For example, in some cases in healthcare, fintech, or cybersecurity contexts, certain subclasses are difficult to learn because they are underrepresented in training data. Our study offers a flexible and efficient solution based on a new synthetic average neighborhood sampling algorithm (SANSA), which, in contrast to other solutions, introduces a novel placement parameter that can be tuned to adapt to each dataset's unique manifestation of the imbalance. This package can be downloaded for R- 1 . We tested SANSA against seven existing sampling methods used in conjunction with the four most frequently used machine learning models trained on 14 benchmark datasets. Our results provide suggestive evidence that SANSA offers a feasible solution to the imbalance problem for most datasets. Our findings provide practical recommendations for how SANSA can be effectively implemented while reducing the complexity level of an imbalanced learning pipeline.