Data migration tools
Download
Report
Transcript Data migration tools
Data Preprocessing
Compiled By:
Umair Yaqub
Lecturer
Govt. Murray College Sialkot
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical
results
2
Data Cleaning
Importance
“Data cleaning is one of the three biggest problems in data warehousing”—Ralph
Kimball
“Data cleaning is the number one problem in data warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
3
Data Cleaning – How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
task is classification)
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as regression, Bayesian formula or decision tree
4
Data Cleaning – How to Handle Missing Data?
Use the most common value of an attribute to fill in the missing value
Use the most common value of an attribute for all samples belonging to
the same class to fill in the missing value: smarter
5
Data Cleaning – How to Handle Noisy Data?
Binning method
smooth a sorted value by consulting its neighborhood
(local smoothing)
first sort data and partition into bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
Many methods for data smoothing are also methods
for data reduction
6
Data Cleaning Tool Categories
Data scrubbing tools
use simple domain knowledge (e.g., knowledge of postal addresses, and
spell-checking) to detect errors and make corrections in the data.
These tools rely on parsing and fuzzy matching techniques when cleaning
data from multiple sources.
Data auditing tools
find discrepancies by analyzing the data to discover rules and relationships,
and detecting data that violate such conditions.
they may employ statistical analysis to find correlations, or clustering to
identify outliers.
Data migration tools
7
allow simple transformations to be specified, such as to replace the string
“gender” by “sex”.
Data Cleaning Tools
Potter’sWheel, for example, is a publicly available data cleaning tool
(see http://control.cs.berkeley.edu/abc) that integrates discrepancy
detection and transformation.
8
Data Integration
Combines data from multiple sources into a coherent store
Issues:
Schema integration
Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id B.cust-#
Metadata can be used to avoid errors in schema integration
Redundancy
The same attribute may have different names in different databases
An attribute may be redundant if it can be derived from another table
Redundancies may be detected by correlation analysis
9
Schema Integration…
Examples of metadata for each attribute include the name, meaning,
data type, and range of values permitted for the attribute, and null rules
for handling blank, zero, or null values.
The metadata may also be used to help transform the data (e.g., where
data codes for pay type in one database may be “H” and “S”, and 1 and
2 in another).
10
Redundancies
Some redundancies can be detected by correlation analysis. Given two
attributes, such analysis can measure how strongly one attribute implies
the other, based on the available data.
For numerical attributes, we can evaluate the correlation between two
attributes, A and B, by computing the correlation coefficient
11
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment coefficient)
rA, B
( A A)( B B )
( AB) n A B
( n 1)AB
( n 1)AB
where n is the number of tuples, A and B are the respective means of A and B, σA
and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of
the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher,
the stronger correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated
12
Correlation Analysis (Categorical Data)
Χ2 (chi-square) test
(Observed Expected) 2
Expected
2
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
13
Data Integration (contd…)
Redundancy (contd…)
Duplication should also be detected at the tuple level
Reasons
Denormalized tables
Inconsistencies may arise between duplicates
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different sources are
different
possible reasons: different representations, different scales, e.g., metric vs.
British units
14