Data Quality - Faculty of Computer Science
Download
Report
Transcript Data Quality - Faculty of Computer Science
Data Quality
Simon Razniewski
In collaboration with Werner Nutt
Free University of Bozen-Bolzano
Introduction
• Simon Razniewski
PhD Student at the FUB
– Data quality
– Data completeness
• Werner Nutt
Professor in Computer Science at the FUB
Focus in research and teaching:
– Data management, data modelling
– Data integration
– Incomplete information
Data Quality
10.8.2011 - EURAC
2
Why data quality?
• Data are the basis for (scientific) conclusions about
the world
• Conclusions only as good as the data they are based on
• Low-quality data low-quality conclusions
Data Quality
10.8.2011 - EURAC
3
Some effects of erroneous data are funny
Man invited for pre-natal check
Data Quality
10.8.2011 - EURAC
4
Some data errors are long-living
Spinach contains much iron
100g of spinach
contain 35mg of
iron
Gustav v. Bunge 1890
100g spinach
contain only
3,5mg of iron
Data Quality
10.8.2011 - EURAC
5
Some data errors are mysterious
Student records in in Georgia (USA), 2009
??
19.000 students leave their school to change to another
… but arrive nowhere
Data Quality
10.8.2011 - EURAC
6
Overview
• What are data used for?
– Data model the real world
• What can go wrong?
– Wrong, outdated, missing data
• What can one do for
– Correctness
– Currency
– Completeness of data?
Data Quality
10.8.2011 - EURAC
7
Data model the real world
Real world:
Students, teachers, classes
Database:
Tables
HOB Bozen
Class 2A
Anna
Paul
Diego
Maria
We analyze the data (instead of the real world)
and draw (scientific) conclusions
data determines our conclusions
Data Quality
10.8.2011 - EURAC
8
Questions about students
• „How many students are there in the class 2A of the HOB
Bozen?“
• „What is the average age of the students of this class?“
• „How many students play an instrument?“
Data Quality
10.8.2011 - EURAC
9
Table „Students“
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
What is the average age of the students of
the class 2A of the HOB Bozen?
Data Quality
10.8.2011 - EURAC
10
Many things can go wrong
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
What is the average age of the students of
the class 2A of the HOB Bozen?
Data Quality
10.8.2011 - EURAC
11
Typos
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
date of birth of Anna
school of Paul
Data Quality
10.8.2011 - EURAC
12
Factual errors
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
school of Diego (“HOB Meran“ instead of “HOB Bozen“)
Data Quality
10.8.2011 - EURAC
13
Outdated entries
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
class of Anna (“1A“ instead of “2A“)
Data Quality
10.8.2011 - EURAC
14
Missing values
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
date of birth of Diego (“Null value“)
Data Quality
10.8.2011 - EURAC
15
Missing records
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
Maria
12.10.1995
HOB Bozen
2A
the record about Maria is missing
Data Quality
10.8.2011 - EURAC
16
Missing concepts
Name
Date of birth
School
Class
Instrument
Paul
7.4.1995
HOB Bolzen
2A
Cello
Anna
3.8.1959
HOB Bozen
1A
?
Diego
?
HOB Meran
2A
?
no possibility to store information
about music instruments
Data Quality
10.8.2011 - EURAC
17
What can be done?
There is a distinction between different
dimensions of data quality
The most important ones are
• Correctness
Does the data match the real world?
• Timeliness
Is the data up-to-date?
• Completeness
Are all aspects of the domain of interest captured?
Further: Comprehensibility, accessability, …
Datenqualität
10.8.2011 - EURAC
18
Dimension 1: Correctness
IT-techniques:
1. Detecting typos or statistical outliers
students born in 1959
2. Recognizing duplicates
Mohammad Al Zaïn = Muhamad Alzain
3. Rules for logical consistency
no student can visit two schools at the same time
Organisation:
Special treatment of core data: Master data management
For example: students, teachers, schools
Datenqualität
10.8.2011 - EURAC
19
Dimension 2: Timeliness
• By workflow organisation:
Bind workflows onto the IT system
Timeliness is guaranteed
Example: an enrolment is only valid
if it is recorded in the database
• Trough data about the currency of the data (metadata)
Timeliness can be estimated
Example: “All dropouts until 31th of March are recorded“
Datenqualität
10.8.2011 - EURAC
20
Dimension 3/1: Completeness of values
• Can be enforced by the IT system
Risk: nonsensical entries
• Alternative solution: enforce input of less values
Record reasons for missing values
E.g. “Not applicable” or “Unknown”
Datenqualität
10.8.2011 - EURAC
21
Dimension 3/3: Conceptual completeness
• Solid design is important,
but not everything can be foreseen
• Flexible IT: Schema changes if necessary
– Space for comments, additional information
• Otherwise: Other fields will be abused
Example: Gasworks in the USA
Warning of dogs for meter-readers
Address Mountain Road 102
(Beware of dog)
… later they send bills
Datenqualität
10.8.2011 - EURAC
22
Dimension 3/2: Table completeness
• Events are completely recorded,
if they are bound to the IT system
Example: Sales in a supermarket
• In general, this binding is not possible
only parts of the database tables are complete
• But: Completeness is only necessary for specific uses
Example: school statistics from ASTAT
Research
Datenqualität
10.8.2011 - EURAC
23
Partial table completeness
• Common scenario: We have
– Some, but not all data complete
– Questions (‘‘queries“) over data
• Problems:
– Do we have the data that is needed to answer
the queries?
If not:
– What more data do we need?
Data Quality
10.8.2011 - EURAC
24
An (intuitive) example
• Suppose we have data about all students from
– Italian schools
– German schools, except of the primary school ‘‘Andreas Hofer“
– Ladin schools, except of the high school “Gherdëna“
• Can we correctly answer questions about the italian
students in South Tyrol?
Yes, because we have all data about students from
italian schools
Data Quality
10.8.2011 - EURAC
25
An (intuitive) example (2)
• Suppose we have data about all students from
– Italian schools
– German schools, except of the primary school ‘‘Andreas Hofer“
– Ladin schools, except of the high school “Gherdëna“
• Can we answer questions about the high-school students
in South Tyrol?
No, because data from the “Gherdëna“ high school is missing
We could bug them to submit their data (but maybe the secretary is
on holiday)
We could ask someone else for the data, e.g., the local district administration
Data Quality
10.8.2011 - EURAC
26
Our research
• How can one describe that data is complete to a certain
extent?
• How can one find out whether the data one has is
sufficient for a certain use?
• How can one find out which data is necessary to serve
a certain use?
Data Quality
10.8.2011 - EURAC
27
Formal example
How many students attend an Italian school?
SELECT count(*)
FROM student, school
WHERE student.school = school.name AND
school.language = ‘italian‘;
Suppose, we have all Italian students.
Can we answer this query completely?
Data Quality
10.8.2011 - EURAC
28
How can we formalize table completeness?
“We have all students from italian schools“
• We imagine: an ideal database that contains
complete information about the world
• Completeness statements refer to this ideal database:
=> All ideal students from Italian schools
occur among our real students
Data Quality
10.8.2011 - EURAC
29
How can we assert
(partial) completeness of tables?
“We have all students from italian schools“
Table completeness assertion:
real.student CONTAINS(
SELECT ideal.student.*
FROM
ideal.student, ideal.school
WHERE ideal.student.school = ideal.school.name
AND ideal.school.language = ‘Italian‘)
Table completeness assertions constitute a
logical theory about real and ideal database
Data Quality
10.8.2011 - EURAC
30
What does it mean that our query is complete?
Consider two versions:
“Real query“
“Ideal query“
SELECT count(*)
FROM real.student, real.school
WHERE
real.student.school = real.school.name
AND
real.school.language = ‘italian‘;
SELECT count(*)
FROM ideal.student, ideal.school
WHERE
ideal.student.school = ideal.school.name
AND
ideal.school.language = ‘italian‘;
Our query is complete if
the real and the ideal query return the same results
(Can be expressed in logic, too)
=> Reasoning
Data Quality
10.8.2011 - EURAC
31
Our results so far
• Formalization
• General reasoning procedures for
– Single block SQL queries
– With comparisons
– Group By
– Aggregate functions min, max, count, sum
• Complexity analysis (sometimes high!)
• Architecture for reasoning system
• “Inverse reasoning” (see later slide)
This is a start, many things are still missing
Data Quality
10.8.2011 - EURAC
32
Reasoning with schema information
• To draw interesting inferences, we need to take into
account
– Keys
– Foreign keys
– Finite domains
~> Reasoning becomes more complicated
(Current research)
Data Quality
10.8.2011 - EURAC
33
Inverse reasoning
• So far:
Given: Assertions about table completeness
Question: Can query Q be answered completely?
• Also interesting:
Given: query Q
Question: which are the minimal completeness
assertions that assure completeness of Q?
• Can be answered by applying our inference methods
backwards
Data Quality
10.8.2011 - EURAC
34
Perspective: Probabilistic completeness
management
• Our theory so far:
Boolean statements: complete/not complete
• In practice, it is often sufficient to know
“With probability < p, we make an error < ε“
• Probabilistic assertions:
“With 90% probability, we are not missing more than 5 students“
=> Probabilistic inferences
Data Quality
10.8.2011 - EURAC
35
Conclusion
• Data quality has several dimensions
– Correctness, timeliness, completeness
• Our current interest
– How can one describe which data are complete
– How can one find out which queries can be answered
completely?
– If not, which additional data is needed?
• Perspective: Probabilistic completeness management
Data Quality
10.8.2011 - EURAC
36