Data Quality - Faculty of Computer Science

Download Report

Transcript Data Quality - Faculty of Computer Science

Data Quality
Simon Razniewski
In collaboration with Werner Nutt
Free University of Bozen-Bolzano
Introduction
• Simon Razniewski
PhD Student at the FUB
– Data quality
– Data completeness
• Werner Nutt
Professor in Computer Science at the FUB
Focus in research and teaching:
– Data management, data modelling
– Data integration
– Incomplete information
Data Quality
10.8.2011 - EURAC
2
Why data quality?
• Data are the basis for (scientific) conclusions about
the world
• Conclusions only as good as the data they are based on
• Low-quality data  low-quality conclusions
Data Quality
10.8.2011 - EURAC
3
Some effects of erroneous data are funny
Man invited for pre-natal check
Data Quality
10.8.2011 - EURAC
4
Some data errors are long-living
Spinach contains much iron
100g of spinach
contain 35mg of
iron
Gustav v. Bunge 1890
100g spinach
contain only
3,5mg of iron
Data Quality
10.8.2011 - EURAC
5
Some data errors are mysterious
Student records in in Georgia (USA), 2009
??
19.000 students leave their school to change to another
… but arrive nowhere
Data Quality
10.8.2011 - EURAC
6
Overview
• What are data used for?
– Data model the real world
• What can go wrong?
– Wrong, outdated, missing data
• What can one do for
– Correctness
– Currency
– Completeness of data?
Data Quality
10.8.2011 - EURAC
7
Data model the real world
Real world:
Students, teachers, classes
Database:
Tables
HOB Bozen
Class 2A
Anna
Paul
Diego
Maria
We analyze the data (instead of the real world)
and draw (scientific) conclusions
 data determines our conclusions
Data Quality
10.8.2011 - EURAC
8
Questions about students
• „How many students are there in the class 2A of the HOB
Bozen?“
• „What is the average age of the students of this class?“
• „How many students play an instrument?“
Data Quality
10.8.2011 - EURAC
9
Table „Students“
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
What is the average age of the students of
the class 2A of the HOB Bozen?
Data Quality
10.8.2011 - EURAC
10
Many things can go wrong
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
What is the average age of the students of
the class 2A of the HOB Bozen?
Data Quality
10.8.2011 - EURAC
11
Typos
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
date of birth of Anna
school of Paul
Data Quality
10.8.2011 - EURAC
12
Factual errors
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
school of Diego (“HOB Meran“ instead of “HOB Bozen“)
Data Quality
10.8.2011 - EURAC
13
Outdated entries
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
class of Anna (“1A“ instead of “2A“)
Data Quality
10.8.2011 - EURAC
14
Missing values
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
date of birth of Diego (“Null value“)
Data Quality
10.8.2011 - EURAC
15
Missing records
Name
Date of birth
School
Class
Paul
7.4.1995
HOB Bolzen
2A
Anna
3.8.1959
HOB Bozen
1A
Diego
?
HOB Meran
2A
Maria
12.10.1995
HOB Bozen
2A
the record about Maria is missing
Data Quality
10.8.2011 - EURAC
16
Missing concepts
Name
Date of birth
School
Class
Instrument
Paul
7.4.1995
HOB Bolzen
2A
Cello
Anna
3.8.1959
HOB Bozen
1A
?
Diego
?
HOB Meran
2A
?
no possibility to store information
about music instruments
Data Quality
10.8.2011 - EURAC
17
What can be done?
There is a distinction between different
dimensions of data quality
The most important ones are
• Correctness
Does the data match the real world?
• Timeliness
Is the data up-to-date?
• Completeness
Are all aspects of the domain of interest captured?
Further: Comprehensibility, accessability, …
Datenqualität
10.8.2011 - EURAC
18
Dimension 1: Correctness
IT-techniques:
1. Detecting typos or statistical outliers
students born in 1959
2. Recognizing duplicates
Mohammad Al Zaïn = Muhamad Alzain
3. Rules for logical consistency
no student can visit two schools at the same time
Organisation:
Special treatment of core data: Master data management
For example: students, teachers, schools
Datenqualität
10.8.2011 - EURAC
19
Dimension 2: Timeliness
• By workflow organisation:
Bind workflows onto the IT system
 Timeliness is guaranteed
Example: an enrolment is only valid
if it is recorded in the database
• Trough data about the currency of the data (metadata)
 Timeliness can be estimated
Example: “All dropouts until 31th of March are recorded“
Datenqualität
10.8.2011 - EURAC
20
Dimension 3/1: Completeness of values
• Can be enforced by the IT system
Risk: nonsensical entries
• Alternative solution: enforce input of less values
 Record reasons for missing values
E.g. “Not applicable” or “Unknown”
Datenqualität
10.8.2011 - EURAC
21
Dimension 3/3: Conceptual completeness
• Solid design is important,
but not everything can be foreseen
• Flexible IT: Schema changes if necessary
– Space for comments, additional information
• Otherwise: Other fields will be abused
Example: Gasworks in the USA
Warning of dogs for meter-readers
Address Mountain Road 102
(Beware of dog)
… later they send bills
Datenqualität
10.8.2011 - EURAC
22
Dimension 3/2: Table completeness
• Events are completely recorded,
if they are bound to the IT system
Example: Sales in a supermarket
• In general, this binding is not possible
 only parts of the database tables are complete
• But: Completeness is only necessary for specific uses
Example: school statistics from ASTAT
 Research
Datenqualität
10.8.2011 - EURAC
23
Partial table completeness
• Common scenario: We have
– Some, but not all data complete
– Questions (‘‘queries“) over data
• Problems:
– Do we have the data that is needed to answer
the queries?
If not:
– What more data do we need?
Data Quality
10.8.2011 - EURAC
24
An (intuitive) example
• Suppose we have data about all students from
– Italian schools
– German schools, except of the primary school ‘‘Andreas Hofer“
– Ladin schools, except of the high school “Gherdëna“
• Can we correctly answer questions about the italian
students in South Tyrol?
 Yes, because we have all data about students from
italian schools
Data Quality
10.8.2011 - EURAC
25
An (intuitive) example (2)
• Suppose we have data about all students from
– Italian schools
– German schools, except of the primary school ‘‘Andreas Hofer“
– Ladin schools, except of the high school “Gherdëna“
• Can we answer questions about the high-school students
in South Tyrol?
 No, because data from the “Gherdëna“ high school is missing
 We could bug them to submit their data (but maybe the secretary is
on holiday)
 We could ask someone else for the data, e.g., the local district administration
Data Quality
10.8.2011 - EURAC
26
Our research
• How can one describe that data is complete to a certain
extent?
• How can one find out whether the data one has is
sufficient for a certain use?
• How can one find out which data is necessary to serve
a certain use?
Data Quality
10.8.2011 - EURAC
27
Formal example
How many students attend an Italian school?
SELECT count(*)
FROM student, school
WHERE student.school = school.name AND
school.language = ‘italian‘;
Suppose, we have all Italian students.
Can we answer this query completely?
Data Quality
10.8.2011 - EURAC
28
How can we formalize table completeness?
“We have all students from italian schools“
• We imagine: an ideal database that contains
complete information about the world
• Completeness statements refer to this ideal database:
=> All ideal students from Italian schools
occur among our real students
Data Quality
10.8.2011 - EURAC
29
How can we assert
(partial) completeness of tables?
“We have all students from italian schools“
Table completeness assertion:
real.student CONTAINS(
SELECT ideal.student.*
FROM
ideal.student, ideal.school
WHERE ideal.student.school = ideal.school.name
AND ideal.school.language = ‘Italian‘)
Table completeness assertions constitute a
logical theory about real and ideal database
Data Quality
10.8.2011 - EURAC
30
What does it mean that our query is complete?
Consider two versions:
“Real query“
“Ideal query“
SELECT count(*)
FROM real.student, real.school
WHERE
real.student.school = real.school.name
AND
real.school.language = ‘italian‘;
SELECT count(*)
FROM ideal.student, ideal.school
WHERE
ideal.student.school = ideal.school.name
AND
ideal.school.language = ‘italian‘;
Our query is complete if
the real and the ideal query return the same results
(Can be expressed in logic, too)
=> Reasoning
Data Quality
10.8.2011 - EURAC
31
Our results so far
• Formalization
• General reasoning procedures for
– Single block SQL queries
– With comparisons
– Group By
– Aggregate functions min, max, count, sum
• Complexity analysis (sometimes high!)
• Architecture for reasoning system
• “Inverse reasoning” (see later slide)
This is a start, many things are still missing
Data Quality
10.8.2011 - EURAC
32
Reasoning with schema information
• To draw interesting inferences, we need to take into
account
– Keys
– Foreign keys
– Finite domains
~> Reasoning becomes more complicated
(Current research)
Data Quality
10.8.2011 - EURAC
33
Inverse reasoning
• So far:
Given: Assertions about table completeness
Question: Can query Q be answered completely?
• Also interesting:
Given: query Q
Question: which are the minimal completeness
assertions that assure completeness of Q?
• Can be answered by applying our inference methods
backwards
Data Quality
10.8.2011 - EURAC
34
Perspective: Probabilistic completeness
management
• Our theory so far:
Boolean statements: complete/not complete
• In practice, it is often sufficient to know
“With probability < p, we make an error < ε“
• Probabilistic assertions:
“With 90% probability, we are not missing more than 5 students“
=> Probabilistic inferences
Data Quality
10.8.2011 - EURAC
35
Conclusion
• Data quality has several dimensions
– Correctness, timeliness, completeness
• Our current interest
– How can one describe which data are complete
– How can one find out which queries can be answered
completely?
– If not, which additional data is needed?
• Perspective: Probabilistic completeness management
Data Quality
10.8.2011 - EURAC
36