Assessing Identification Risk in Survey Microdata Using Log-Linear Models

成果类型:
Article
署名作者:
Skinner, Chris; Shlomo, Natalie
署名单位:
University of Southampton
刊物名称:
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
ISSN/ISSBN:
0162-1459
DOI:
10.1198/016214507000001328
发表日期:
2008
页码:
989-1001
关键词:
disclosure risk
摘要:
This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical key variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well established in the literature, little consideration has been given to model specification or to the sensitivity of risk assessment to this specification. In numerical work not reported here, we have found that standard techniques for selecting log-linear models, such as chi-squared goodnmess-of-fit tests, provide little guidance regarding the accuracy of risk estimation for teh very sparse tables generated by typical applications at ONS, for example, tables with millions of cells formed by cross-classifying six key variables, with sample sizes of 10 or 100,000. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, withing a class of reasonable models, risk estiamtes tend to decrease as the complexity fo the model increases. We develop criteria that detect underlitting (associated with overestimation of the risk). The criteria may also reveal overfitting (associated with underestiamtion) lathough not so clearly, so we suggest employing a forward model selection approach. our criteria turn out to be related to established methods of testing for overdispersion in Poisson log-linear models. We show how our approach may be used for file-level and record-leveol measures of risk. We evaluate the proposed procedures using samples drawn from the 2001 UK Census where the true risks can be determined and show that a forward selection approach leads to good risk estimates. There are several good models between which our approach provides little discrimination. The risk estimates are found to be stable across these models, implying a form of robustness. We also apply our approach to a large survey dataset. There is no indication that increasing the sample size necessarily leads to the selection of a more complec model. The risk estimates for this application display more variation but suggest a suitable upper bound.