A PRIVACY-PRESERVED AND HIGH-UTILITY SYNTHESIS STRATEGY FOR RISK-BASED STRATIFIED SUBGROUPS OF THE CANADIAN SCLERODERMA PATIENT REGISTRY DATA

成果类型:
Article
署名作者:
Jiang, Bei; Raftery, Adrian E.; Steele, Russell J.; Wang, Naisyin
署名单位:
University of Alberta; University of Washington; University of Washington Seattle; McGill University; University of Michigan System; University of Michigan
刊物名称:
ANNALS OF APPLIED STATISTICS
ISSN/ISSBN:
1932-6157
DOI:
10.1214/24-AOAS2009
发表日期:
2025
页码:
1240-1269
关键词:
disclosure
摘要:
Responsible data sharing anchors research reproducibility and promotes the integrity of scientific research. Motivated by Canadian Scleroderma Research Group (CSRG) patient registry data, we present a risk-based method to produce privacy-preserved and high-utility synthetic datasets, which also simultaneously imputes missing data of mixed continuous and categorical types in the original dataset. This method divides all individuals into different subgroups, based on their reidentification risks, and provides tailored synthesis strategies targeted for each risk subgroup, through the associated tuning mechanisms. Under our setting, our risk-based method reduced the number of patients at risk from 198 to four, among the 691 CSRG patients who have no missing values in any of the quasi-identifying variables, while preserving all correct inferential conclusions in the target analysis. The 95% confidence intervals (CIs) have 92.6% overlap, on average, with the CIs constructed using the unperturbed imputation-completed datasets. These findings suggest that our risk-based method makes it possible to release complete synthetic datasets for research reproducibility while ensuring that the reidentification risks are acceptably low. In contrast, the existing one-size-fits-all synthesis strategies that do not take account of different risk levels can lead to unnecessary information loss and possibly incorrect scientific conclusions.
来源URL: