LN22 - WSU EECS
Download
Report
Transcript LN22 - WSU EECS
CPT-S 580-06
Advanced Databases
Yinghui Wu
EME 49
1
Advanced Databases
Data quality management: An introduction
The data quality problems
Central aspects of data quality
–
–
–
–
–
Data consistency
Data accuracy
Entity resolution (record matching)
Information completeness
Data currency
A real-life encounter
Mr. Smith, our database records indicate that you owe us an
outstanding amount of £5,921 for council tax for 2007
NI#
name
AC
phone
street
city
zip
…
…
…
…
…
…
…
SC35621422
M. Smith
131
3456789
Crichton
EDI
EH8 9LE
SC35621422
M. Smith 020
6728593
Baker
LDN
NW1 6XE
Mr. Smith already moved to London in 2006
The council database had not been correctly updated
– both old address and the new one are in the database
50% of bills have errors (phone bill reviews, 1992)
3
Customer records
country
AC
phone
street
city
zip
44
131
1234567
Mayfield
New York
EH8 9LE
44
131
3456789
Crichton
New York
EH8 9LE
01
908
3456789
Mountain Ave
New York
07974
Anything wrong?
New York City is moved to the UK (country code: 44)
Murray Hill (01-908) in New Jersey is moved to New York state
Error rates: 10% - 75% (telecommunication)
4
Another example
T.Das|97336o8327|24.95|Y|-|0.0|1000
Ted J.|973-360-8779|2000|N|M|NY|1000
Can we interpret the data?
– What do the fields mean?
– What is the key? The measures?
Data glitches
– Typos, multiple formats, missing / default values
Metadata and domain expertise
– Field three is Revenue. In dollars or cents?
– Field seven is Usage. Is it censored?
• Field 4 is a censored flag. How to handle censored data?
Data in real-life are often dirty
Pentagon asked
200+ dead officers
to re-enlist
81 million National Insurance
numbers but only 60 million
eligible citizens
Data quality problems are expensive
hundreds of billion $$$ each year.
98000 deaths cost
each
Resolving data quality problems is
year, caused by errors
often the biggest effort in databases
in medical data research
• In a 500,000 customer database,500,000
120,000dead
customer
records
people
become invalid within 12 months retain active Medicare
• Data error rates in industry: 1% - cards
30% (Redman, 1998)
Dirty data: inconsistent, inaccurate, incomplete, stale
6
The need for data quality tools
Manual effort: beyond reach in practice
Data quality tools: to help automatically
Reasoning
Discover rules
Repair
Editing a sample of census
Detect
errors
data
easily took dozens of
clerks months (Winkler 04,
US Census Bureau)
The market for data quality tools is growing at 17% annually
>> the 7% average of other IT segments
2006
7
Conventional Definition of Data Quality
Consistency
– The data agrees with itself.
Accuracy
– The data was recorded correctly
Uniqueness
– Entities are recorded once.
Completeness
– All relevant data was recorded.
Timeliness
– The data is kept up to date.
• Special problems in federated data: time consistency.
Challenges
Unmeasurable
– Accuracy and completeness are extremely difficult, perhaps
impossible to measure.
Context independent
– No accounting for what is important. E.g., if you are
computing aggregates, you can tolerate a lot of inaccuracy.
Incomplete
– interpretability, accessibility, metadata, analysis, etc.
Vague
– The conventional definitions provide no guidance towards
practical improvements of the data.
Central aspects of data quality
Data consistency
Data accuracy
Entity resolution (record matching)
Information completeness
Data currency
10
Data consistency
11
Inconsistencies of data
[country = 44, zip] [street]
In the UK, zip code uniquely determines the street
country
area-code
phone
street
city
zip
44
131
1234567
Mayfield
NYC
EH8 9LE
44
131
3456789
Crichton
NYC
EH8 9LE
01
908
3456789
Mountain Ave
NYC
07974
Observation
Do not hold on the entire relation (e.g., customers in the US), but on
tuples representing the UK customers only
Data dependencies over a fraction of tuples
12
Inconsistent relation
type(Reagan, president) [10]
married(Reagan, Davis) [10]
married(Elvis,Priscilla)
[10]
spouse(X,Y) & spouse(X,Z) => Y=Z [10]
occurs("Hermione","loves","Harry") [3]
means("Ron",RonaldReagan) [3]
means("Ron",RonaldWeasley) [2]
...
Possible World 1:
married
We are given a set of facts, and
wish to find the most plausible
possible world.
(F. Suchanek et al.: WWW‘09)
occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y')
P~R
occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R
R(X',Y')
spouse(X,Y) & spouse(X,Z) Y=Z
Possible World 2:
married
Data Accuracy
CC
AC
FN
LN
age
44
131
Mark
Smith
65
19 is more
accurate
occupation
city
student (HS)
EDI
zip
EH8 9LE
Consistency rule: age < 120. The record is consistent. Is it accurate?
Context: if we know Mark is going to high school, then age 19
[occupation = student (HS)] [age 19]
A uniform framework for improving data consistency and accuracy
Entity resolution
15
Entity resolution (Record matching)
To identify records from unreliable data sources that refer to
the same real-world entity
FN
Mark
LN
address
Smith 10 Oak St, EDI, EH8 9LE
tel
DOB
gender
3256777
10/27/97
M
the same person?
FN
LN
post
phn
when
M.
Smith
10 Oak St, EDI, EH8 9LE
null
1pm/7/7/09
EDI
$3,500
…
…
…
…
…
…
…
NYC
$6,300
Max Smith
PO Box 25, EDI
3256777 2pm/7/7/09
where amount
Record linkage, entity resolution, data deduplication, merge/purge, …
16
Knowledge graph construction
Dubya
0.9
George W.
Bush
0.9
0.6
0.6
0.6
Bush
0.8
President Bush
0.4
0.3
President Bill
Clinton
Bill Clinton
Hillary Clinton
George H. W.
Bush
17
Data completeness
TDD (ln6)
18
Incomplete information: a central data quality issue
A database D of UK patients: patient (name, street, city, zip, YoB)
A simple query Q1: Find the streets of those patients who
were born in 2000 (YoB), and
live in Edinburgh (Edi) with zip = “EH8 9AB”.
Can we trust the query to find complete & accurate information?
Both tuples and values may be missing from D!
“information perceived as being needed for clinical decisions
was unavailable 13.6%--81% of the time” (2005)
19
Traditional approaches: The CWA vs. the OWA
Real world
The Closed World Assumption (CWA)
– all the real-world objects are already
represented by tuples in the database
– missing values only
The Open World Assumption (OWA)
– the database is a subset of the tuples
representing real-world objects
– missing tuples and missing values
database
Real world
database
Few queries can find a complete answer under the OWA
Misleading analysis, biased decisions, disastrous consequences
(enterprises, finance, scientific,TDD
administrative,
medical, … )
(ln6)
20
In real-world applications
Master data (reference data): a consistent and complete repository
of the core business entities of an enterprise (certain categories)
OWA
Master
data
The CWA: the master data – an upper bound of the part constrained
The OWA: the part not covered by the master data
Databases in real world are often
neither entirely closed-world, nor entirely open-world
21
Relative information completeness
Partially closed databases: partially constrained by master data;
neither CWA nor OWA
Relative completeness: a partially closed database that has
complete information to answer a query relative to master data
The completeness and consistency taken together: containment
constraints
Fundamental problems:
– Given a partially closed database D, master data Dm, and a
Matching
complexity
bounds
are to Dm
query Q, decide whether
D is complete
Q for
relatively
in place;
efficient
– Given master data Dmalready
and a query
Q, but
decide
whether there
exists a partially closed
databaseremain
D that to
is be
complete
for Q
algorithms
developed
relatively to Dm
A theory of relative information completeness
22
Data currency
23
Data currency: another central data quality issue
Data currency: the state of the data being current
Data get obsolete quickly: “In a customer file, within two years
about 50% of record may become obsolete” (2002)
Multiple values pertaining to the same entity are present
The values were once correct, but they have become stale and
inaccurate
Reliable timestamps are often not available
Identifying stale data is
costly and difficult
How can we tell when the data are current or stale?
24
Determining the currency of data
FN
LN
address
salary
status
Mary
Smith
2 Small St
50k
single
Mary
Dupont
10 Elm St
50k
married
Mary
Dupont
6 Main St
80k
married
Bob
Luth
8 Cowan St
80k
married
Robert
Luth
6 Drum St
55k
married
Entities:
Mary
Robert Identified via record matching
Q1: what is Mary’s current salary?
80k
Temporal constraint: salary is monotonically increasing
Determining data currency in theTDD
absence
of timestamps
(ln6)
25
Temporal constraints
FN
LN
address
salary
status
Mary
Smith
2 Small St
50k
single
Mary
Dupont
10 Elm St
50k
married
Mary
Dupont
6 Main St
80k
married
Bob
Luth
8 Cowan St
80k
married
Robert
Luth
6 Drum St
55k
married
Q2: what is Mary’s current last name? Dupont
Temporal constraints:
•
•
Marital status only changes from single married divorced
Tuples with the most current marital status also have the
most current last name
TDD (ln6)
Temporal constraints to specify the
currency of correlated attributes
26
Data currency
Data currency model:
•
Partial temporal orders, temporal constraints, copy functions
Certain current query answering: answering queries with the
current values of entities (over all possible “consistent
completions” of the partial temporal orders)
Fundamental problems: Given partial temporal orders, temporal
constraints and copy functions, to decide
– whether a value is certainly more current than another?
– whether a tuple is a certainFundamental
current answer
to a query?
problems
have been
– whether the copy functionsstudied;
are currency
preserving?
but efficient
algorithms are
No matter how the copy functions
extended, the certain
not yet inare
place
currency answers remain unchanged
There is much more to be doneTDD (ln6)
27
Dependencies: A promising approach
Errors found in practice
– Syntactic: a value not in the corresponding domain or range,
e.g., name = 1.23, age = 250
– Semantic: a value representing a real-world entity different from the
true value of the entity, e.g., CIA found WMD in Iraq
Dependencies: for specifying the semantics of relational data
– relation
(table):
a setand
of tuples
(records)
Hard
to detect
fix
NI#
name
AC
phone
street
city
zip
SC35621422
M. Smith
131
3456789
Crichton
EDI
EH8 9LE
SC35621422
M. Smith
020
6728593
Baker
LDN
NW1 6XE
How can dependencies help?
28
Functional Dependencies (FDs)
NI# street, city, zip
NI# determines address: for any two records, if they have the
same NI#, then they must have the same address
for each distinct NI#, there is a unique current address
NI#
name
AC
phone
street
city
zip
SC35621422
M. Smith
131
3456789
Crichton
EDI
EH8 9LE
SC35621422
M. Smith
020
6728593
Baker
LDN
NW1 6XE
for SC35621422, at least one of the addresses is not up to date
Functional dependencies help detect errors in a single relation
29
Inclusion Dependencies (INDs)
book[asin, title, price] item[asin, title, price]
book
item
asin
isbn
title
price
a23
b32
Harry Potter
17.99
a56
b65
Snow white
7.94
asin
title
type
price
a23
Harry Potter
book
17.99
a12
J. Denver
CD
7.94
Any book sold by a store must be an item carried by the store
– for any book tuple, there must exist an item tuple such that their
asin, title and price attributes pairwise agree with each other
Inclusion dependencies help detect errors across relations
30
More reasons for using dependencies
Capturing a fundamental part of the semantics of data
– errors and inconsistencies as violations of dependencies
Techniques and inference systems are in place for reasoning
about dependencies (data quality rules)
– remove redundant rules
– identify “dirty” rules
– …
Profiling
– automated discovery of dependencies (data quality rules)
...
Dependencies should logically become part of data cleaning process
31
However, …
Dependencies were developed for improving the quality of schema
country
area-code
phone
street
city
zip
44
131
1234567
Mayfield
NYC
EH8 9LE
44
131
3456789
Crichton
NYC
EH8 9LE
01
908
3456789
Mountain Ave
NYC
07974
functional dependencies (FDs)
country, area-code, phone street, city, zip
country, area-code city
The database satisfies the FDs, but the data is not clean!
A central problem is how to tell whether the data is dirty or clean
32
Summary
Data quality: The No.1 problem for data management
Real life data are dirty, dirty data are costly
– The quest for a principled approach
Many challenges remain
–
–
–
–
–
Effective algorithms for certain fixes (minimum user interaction)
Efficient algorithms for determining information completeness
Efficient algorithms for deciding data currency
Data accuracy
Putting all together: Interaction between central issues of data
quality
telecommunication, life sciences, finance, e-government, …
Data quality: A rich source of questions and vitality
33