Transcript ppt

Data
Cleaning
An Overview of data quality problems and data cleaning solution approaches
Seminarvortrag: Digital Information Curation
Jens Gerken – 20.12.2005
ZuiScat
Konstanz, 20.12.2005
Jens Gerken
Outline
1 - 12

Motivation

Data Quality Problems

Data Cleaning Process

Typical Conflict Resolution Steps

Data Cleaning Tool: Potter’s Wheel

Conclusion
Konstanz, 20.12.2005
Jens Gerken
Motivation

2 - 12
Problem: How to deal with data errors and inconsistencies in large data
collections such as databases
 e.g. missing information, spelling errors, invalid data, etc..

Especially crucial for data warehouses – combination of several sources
where errors quickly multiply and problem of redundant data arises

Objectives of Data Cleaning:
 Detection of Errors
 Remove Errors
 In a (semi) automatic process
Konstanz, 20.12.2005
Jens Gerken
Data Quality Problems (1/2)

Single Source Problems
 Can be schema related as well as instance related
Konstanz, 20.12.2005
Jens Gerken
3 - 12
Data Quality Problems (2/2)

Multi Source Problems
 Overlapping or contradicting data
 Different representations
 Again both schema & instance related problems
Konstanz, 20.12.2005
Jens Gerken
4 - 12
Data Cleaning Process (1/3)

Data Auditing
 Analysis of Data to find errors

Workflow Specification
 Definition of appropriate transformations
 Goal: Eliminate all/most errors and anomalies automatically

Workflow Execution

Post Processing/Controlling
 Check results
 Resolve remaining problems manually
Konstanz, 20.12.2005
Jens Gerken
5 - 12
Data Cleaning Process (2/3)

6 - 12
Data Auditing
 Data profiling – create Metadata by analysing individual attributes on the
instance level (e.g. data type, length, value range, discrete values, variance,
uniqueness)
 Data mining – discover data patterns by analysing the whole data collection (e.g.
association rules like „total = quantity*unit price“)
Konstanz, 20.12.2005
Jens Gerken
Data Cleaning Process (3/3)

7 – 12
Workflow Specification
 Sequence of operations on the data (schema related data transformations &
cleaning steps)
 Choose the appropriate operations to handle data errors/anomalies/etc.
 Process should be as automatically as possible
 Possibility to include user-written cleaning code
 Cause of error important (e.g. keyboard layout can help to correct misspellings)
 Early steps: correct single source instance problems
 Later on: multi source problems, e.g. duplicates
 Workflow has to be tested and verified
Konstanz, 20.12.2005
Jens Gerken
Conflict Resolution Steps (1/2)



Extracting values from free-form attributes

Reordering of values

Value extraction for attribute splitting
Validation and correction

Spell checking & dictionaries

Attribute dependencies (e.g. birthday/age)

Statistical methods (e.g. replace missing values with mean)
Standardization


Schema restructuring (multi source problems)


Conversion of date and times entries
Splitting, merging, folding, unfolding of attributes and tables
Duplicate elimination
Konstanz, 20.12.2005
Jens Gerken
8 - 12
Conflict Resolution Steps (2/2)

9 - 12
Duplicate Elimination
 Last step in workflow
 First step: identify records concerning the same real world entity (instance
matching).
 Second step: merge those records and remove redundancy

Instance Matching
 Easy way: There is an identifier attribute (e.g. same primary key)
 More difficult way: fuzzy matching which calculates degree of similarity between
records
 Problem: very expensive operation – however overhead reducing methods exist
Konstanz, 20.12.2005
Jens Gerken
Potter‘s Wheel (1/3)

Problem of conventional approaches
 Time consuming (many iterations), long waiting periods
 Users have to write complex transformation scripts
 Separate Tools for auditing and transformation

Potter‘s Wheel approach:
 Interactive system, instant feedback
 Integration of both, data auditing and transformation
 Intuitive User Interface – spreadsheet like application
Konstanz, 20.12.2005
Jens Gerken
10 - 12
Potter‘s Wheel (2/3)

11 - 12
Main Features:

Instead of complex transform specifications with regular expressions or custom programs 
user specifies by example (e.g. splitting)

Data auditing extensible with user defined domains

Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]* <Airport> to <Airport> on <Date> <Class>“
instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*

Allows easier detection of e.g. logical errors like false airport codes

Problem: tradeoff between overfitting and underfitting structure

Potter‘s Wheel uses Minimun description length method to balance this tradeoff and choose appropriate structure

Data auditing in background on the fly (data streaming also possible)

Reorderer allows sorting on the fly

User only works on a view – real data isn‘t changed until user exports set of transforms e.g.
as C program an runs it on the real data

Undo without problems: just delete unwanted transform from sequence and redo everything else
Konstanz, 20.12.2005
Jens Gerken
Potter‘s Wheel (3/3) + Conclusion

Problems:
 Usability of User Interface
 How does duplicate elimination work?
 Kind of a black box system

General Open Problems of Data Cleaning:
 (Automatic) correction of wrong values

Mask wrong values but keep them

Keep several possible values at the same time (2*age. 2*birthday)

Leeds to problems if other values depend on a certain alternative and this turns out to be wrong
 Maintenance of cleaned data, especially if sources can‘t be cleaned
 Data cleaning framework desireable
Konstanz, 20.12.2005
Jens Gerken
12 - 12