Transcript ppt
Data
Cleaning
An Overview of data quality problems and data cleaning solution approaches
Seminarvortrag: Digital Information Curation
Jens Gerken – 20.12.2005
ZuiScat
Konstanz, 20.12.2005
Jens Gerken
Outline
1 - 12
Motivation
Data Quality Problems
Data Cleaning Process
Typical Conflict Resolution Steps
Data Cleaning Tool: Potter’s Wheel
Conclusion
Konstanz, 20.12.2005
Jens Gerken
Motivation
2 - 12
Problem: How to deal with data errors and inconsistencies in large data
collections such as databases
e.g. missing information, spelling errors, invalid data, etc..
Especially crucial for data warehouses – combination of several sources
where errors quickly multiply and problem of redundant data arises
Objectives of Data Cleaning:
Detection of Errors
Remove Errors
In a (semi) automatic process
Konstanz, 20.12.2005
Jens Gerken
Data Quality Problems (1/2)
Single Source Problems
Can be schema related as well as instance related
Konstanz, 20.12.2005
Jens Gerken
3 - 12
Data Quality Problems (2/2)
Multi Source Problems
Overlapping or contradicting data
Different representations
Again both schema & instance related problems
Konstanz, 20.12.2005
Jens Gerken
4 - 12
Data Cleaning Process (1/3)
Data Auditing
Analysis of Data to find errors
Workflow Specification
Definition of appropriate transformations
Goal: Eliminate all/most errors and anomalies automatically
Workflow Execution
Post Processing/Controlling
Check results
Resolve remaining problems manually
Konstanz, 20.12.2005
Jens Gerken
5 - 12
Data Cleaning Process (2/3)
6 - 12
Data Auditing
Data profiling – create Metadata by analysing individual attributes on the
instance level (e.g. data type, length, value range, discrete values, variance,
uniqueness)
Data mining – discover data patterns by analysing the whole data collection (e.g.
association rules like „total = quantity*unit price“)
Konstanz, 20.12.2005
Jens Gerken
Data Cleaning Process (3/3)
7 – 12
Workflow Specification
Sequence of operations on the data (schema related data transformations &
cleaning steps)
Choose the appropriate operations to handle data errors/anomalies/etc.
Process should be as automatically as possible
Possibility to include user-written cleaning code
Cause of error important (e.g. keyboard layout can help to correct misspellings)
Early steps: correct single source instance problems
Later on: multi source problems, e.g. duplicates
Workflow has to be tested and verified
Konstanz, 20.12.2005
Jens Gerken
Conflict Resolution Steps (1/2)
Extracting values from free-form attributes
Reordering of values
Value extraction for attribute splitting
Validation and correction
Spell checking & dictionaries
Attribute dependencies (e.g. birthday/age)
Statistical methods (e.g. replace missing values with mean)
Standardization
Schema restructuring (multi source problems)
Conversion of date and times entries
Splitting, merging, folding, unfolding of attributes and tables
Duplicate elimination
Konstanz, 20.12.2005
Jens Gerken
8 - 12
Conflict Resolution Steps (2/2)
9 - 12
Duplicate Elimination
Last step in workflow
First step: identify records concerning the same real world entity (instance
matching).
Second step: merge those records and remove redundancy
Instance Matching
Easy way: There is an identifier attribute (e.g. same primary key)
More difficult way: fuzzy matching which calculates degree of similarity between
records
Problem: very expensive operation – however overhead reducing methods exist
Konstanz, 20.12.2005
Jens Gerken
Potter‘s Wheel (1/3)
Problem of conventional approaches
Time consuming (many iterations), long waiting periods
Users have to write complex transformation scripts
Separate Tools for auditing and transformation
Potter‘s Wheel approach:
Interactive system, instant feedback
Integration of both, data auditing and transformation
Intuitive User Interface – spreadsheet like application
Konstanz, 20.12.2005
Jens Gerken
10 - 12
Potter‘s Wheel (2/3)
11 - 12
Main Features:
Instead of complex transform specifications with regular expressions or custom programs
user specifies by example (e.g. splitting)
Data auditing extensible with user defined domains
Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]* <Airport> to <Airport> on <Date> <Class>“
instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*
Allows easier detection of e.g. logical errors like false airport codes
Problem: tradeoff between overfitting and underfitting structure
Potter‘s Wheel uses Minimun description length method to balance this tradeoff and choose appropriate structure
Data auditing in background on the fly (data streaming also possible)
Reorderer allows sorting on the fly
User only works on a view – real data isn‘t changed until user exports set of transforms e.g.
as C program an runs it on the real data
Undo without problems: just delete unwanted transform from sequence and redo everything else
Konstanz, 20.12.2005
Jens Gerken
Potter‘s Wheel (3/3) + Conclusion
Problems:
Usability of User Interface
How does duplicate elimination work?
Kind of a black box system
General Open Problems of Data Cleaning:
(Automatic) correction of wrong values
Mask wrong values but keep them
Keep several possible values at the same time (2*age. 2*birthday)
Leeds to problems if other values depend on a certain alternative and this turns out to be wrong
Maintenance of cleaned data, especially if sources can‘t be cleaned
Data cleaning framework desireable
Konstanz, 20.12.2005
Jens Gerken
12 - 12