BAYESIAN ANALYSIS FOR IMBALANCED POSITIVE-UNLABELLED DIAGNOSIS CODES IN ELECTRONIC HEALTH RECORDS
成果类型:
Article
署名作者:
Wang, By Ru; Liang, Ye; Miao, Zhuqi; Liu, Tieming
署名单位:
Oklahoma State University System; Oklahoma State University - Stillwater; State University of New York (SUNY) System; SUNY New Paltz; Oklahoma State University System; Oklahoma State University - Stillwater
刊物名称:
ANNALS OF APPLIED STATISTICS
ISSN/ISSBN:
1932-6157
DOI:
10.1214/22-AOAS1666
发表日期:
2023
页码:
1220-1238
关键词:
major risk-factors
diabetic-retinopathy
global prevalence
Mixture Model
components
摘要:
With the increasing availability of electronic health records (EHR), sig-nificant progress has been made on developing predictive inference and al-gorithms by health-data analysts and researchers. However, the EHR data are notoriously noisy, due to missing and inaccurate inputs, despite abundant in-formation. One serious problem is that only a small portion of patients in the database has confirmatory diagnoses, while many other patients remain undiagnosed because they did not comply with the recommended examina-tions. The phenomenon leads to a so-called positive-unlabelled situation, and the labels are extremely imbalanced. In this paper we propose a model-based approach to classify the unlabelled patients by using a Bayesian finite mix-ture model. We also discuss the label switching issue for the imbalanced data and propose a consensus Monte Carlo approach to address the imbalance is-sue and improve computational efficiency simultaneously. Simulation studies show that our proposed model-based approach outperforms existing positive -unlabelled learning algorithms. The proposed method is applied on the Cerner EHR for detecting diabetic retinopathy (DR) patients using laboratory mea-surements. With only 3% confirmatory diagnoses in the EHR database, we estimate the actual DR prevalence to be 25% which coincides with reported findings in the medical literature.
来源URL: