Data migration tools

Download Report

Transcript Data migration tools

Data Preprocessing
Compiled By:
Umair Yaqub
Lecturer
Govt. Murray College Sialkot
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or similar analytical
results
2
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems in data warehousing”—Ralph
Kimball
 “Data cleaning is the number one problem in data warehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
3
Data Cleaning – How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming the
task is classification)
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter
 Use the most probable value to fill in the missing value: inference-based
such as regression, Bayesian formula or decision tree
4
Data Cleaning – How to Handle Missing Data?
 Use the most common value of an attribute to fill in the missing value
 Use the most common value of an attribute for all samples belonging to
the same class to fill in the missing value: smarter
5
Data Cleaning – How to Handle Noisy Data?
 Binning method
 smooth a sorted value by consulting its neighborhood
(local smoothing)
 first sort data and partition into bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human
 Many methods for data smoothing are also methods
for data reduction
6
Data Cleaning Tool Categories
 Data scrubbing tools


use simple domain knowledge (e.g., knowledge of postal addresses, and
spell-checking) to detect errors and make corrections in the data.
These tools rely on parsing and fuzzy matching techniques when cleaning
data from multiple sources.
 Data auditing tools


find discrepancies by analyzing the data to discover rules and relationships,
and detecting data that violate such conditions.
they may employ statistical analysis to find correlations, or clustering to
identify outliers.
 Data migration tools

7
allow simple transformations to be specified, such as to replace the string
“gender” by “sex”.
Data Cleaning Tools
 Potter’sWheel, for example, is a publicly available data cleaning tool
(see http://control.cs.berkeley.edu/abc) that integrates discrepancy
detection and transformation.
8
Data Integration
 Combines data from multiple sources into a coherent store
 Issues:
 Schema integration
 Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id  B.cust-#
 Metadata can be used to avoid errors in schema integration
 Redundancy
 The same attribute may have different names in different databases
 An attribute may be redundant if it can be derived from another table
 Redundancies may be detected by correlation analysis
9
Schema Integration…
 Examples of metadata for each attribute include the name, meaning,
data type, and range of values permitted for the attribute, and null rules
for handling blank, zero, or null values.
 The metadata may also be used to help transform the data (e.g., where
data codes for pay type in one database may be “H” and “S”, and 1 and
2 in another).
10
Redundancies
 Some redundancies can be detected by correlation analysis. Given two
attributes, such analysis can measure how strongly one attribute implies
the other, based on the available data.
 For numerical attributes, we can evaluate the correlation between two
attributes, A and B, by computing the correlation coefficient
11
Correlation Analysis (Numerical Data)
 Correlation coefficient (also called Pearson’s product moment coefficient)
rA, B
( A  A)( B  B )
( AB)  n A B




( n  1)AB
( n  1)AB
where n is the number of tuples, A and B are the respective means of A and B, σA
and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of
the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher,
the stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated
12
Correlation Analysis (Categorical Data)
 Χ2 (chi-square) test
(Observed  Expected) 2
 
Expected
2
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
13
Data Integration (contd…)

Redundancy (contd…)
 Duplication should also be detected at the tuple level
 Reasons

Denormalized tables
 Inconsistencies may arise between duplicates

Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different sources are
different
 possible reasons: different representations, different scales, e.g., metric vs.
British units
14