PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA

成果类型:
Article
署名作者:
Gu, Yu; Preisser, John S.; Zeng, Donglin; Shrestha, Poojan; Shah, Molina; Simancas-Pallares, Miguel A.; Ginnis, Jeannie; Divaris, Kimon
署名单位:
University of North Carolina; University of North Carolina Chapel Hill; University of North Carolina School of Medicine; University of North Carolina; University of North Carolina Chapel Hill; University of North Carolina School of Medicine; University of North Carolina; University of North Carolina Chapel Hill; University of North Carolina School of Medicine
刊物名称:
ANNALS OF APPLIED STATISTICS
ISSN/ISSBN:
1932-6157
DOI:
10.1214/21-AOAS1516
发表日期:
2022
页码:
551-572
关键词:
oral-health interpolation
摘要:
Community water fluoridation is an important component of oral health promotion, as fluoride exposure is a well-documented dental caries-preventive agent. Direct measurements of domestic water fluoride content provide valuable information regarding individuals' fluoride exposure and thus caries risk; however, they are logistically challenging to carry out at a large scale in oral health research. This article describes the development and evaluation of a novel method for the imputation of missing domestic water fluoride concentration data informed by spatial autocorrelation. The context is a state-wide epidemiologic study of pediatric oral health in North Carolina, where domestic water fluoride concentration information was missing for approximately 75% of study participants with clinical data on dental caries. A new machine-learning-based imputation method that combines partitioning around medoids clustering and random forest classification (PAMRF) is developed and implemented. Imputed values are filtered according to allowable error rates or target sample size, depending on the requirements of each application. In leave-one-out cross-validation and simulation studies, PAMRF outperforms four existing imputation approaches-two conventional spatial interpolation methods (i.e., inverse-distance weighting, IDW and universal kriging, UK) and two supervised learning methods (k-nearest neighbors, KNN, and classification and regression trees, CART). The inclusion of multiply imputed values in the estimation of the association between fluoride concentration and dental caries prevalence resulted in essentially no change in PAMRF estimates but substantial gains in precision due to larger effective sample size. PAMRF is a powerful new method for the imputation of missing fluoride values where geographical information exists.
来源URL: