您的位置: 首页 > 全球经管学术 > 顶刊追踪 > 顶尖期刊 > 运营管理 > Management Science > 2007 > 12期

A framework for reconciling attribute values from multiple data sources

成果类型：

Article

署名作者：

Jiang, Zhengrui; Sarkar, Sumit; De, Prabuddha; Dey, Debabrata

署名单位：

University of Texas System; University of Texas Dallas; Purdue University System; Purdue University; University of Washington; University of Washington Seattle

刊物名称：

MANAGEMENT SCIENCE

ISSN/ISSBN：

0025-1909

DOI：

10.1287/mnsc.1070.0745

发表日期：

2007

页码：

1946-1963

关键词：

data integration heterogeneous databases probabilistic databases data quality type I error type II error misrepresentation error

摘要：

Because of the heterogeneous nature of different data sources, data integration is often one of the most challenging tasks in managing modern information systems. While the existing literature has focused on problems such as schema integration and entity identification, it has largely overlooked a basic question: When an attribute value for a real-world entity is recorded differently in different databases, how should the ''best'' value be chosen from the set of possible values? This paper provides an answer to this question. We first show how a probability distribution over a set of possible values can be derived. We then demonstrate how these probabilities can be used to solve a given decision problem by minimizing the total cost of type I, type II, and misrepresentation errors. Finally, we propose a framework for integrating multiple data sources when a single ''best'' value has to be chosen and stored for every attribute of an entity.