Tools for Data Preparation

Download Report

Transcript Tools for Data Preparation

Tools for Data Preparation
November 8, 2002
Why Data Preparation?
Time to Complete Importance to
Data Mining Stages
Success(% of Total)
(% of Total)
Data Preparation
75
75
Data Surveying
20
15
Data Modeling
5
10
Source: D Pyle, Data Preparation for Data Mining, 1999
Data Preparation Process
Data
Selection
Data
Cleaning
New Data
Construction
Data
Formatting
Data Selection
Based on The Following Criteria:
 Data quality properties:
completeness and correctness
 Technical constraints such as limits on data volume
or data type:
related to data mining tools
Data Cleaning
Possible Techniques for Data Cleaning:
 Data normalization.
e.g., decimal scaling into the range (0,1) by
mapping, or standard deviation normalization.
 Data smoothing.
e.g. Discretization of numeric attributes, this is
helpful or even necessary for logic based
methods.
Data Cleaning Cont’d


Treatment of missing values.
Predict missing values & replace them with the least
biased values. e.g. Preserve the relationship
between variables.
Data Reduction.
The most usual step: examine the attributes and
consider their predictive potential.
e.g. attribute selection from means and variances,
merging features using linear transform.
Data Missing Example
Position
Original
Sample
Position 11
Missing
Preserve
Mean
Preserve
Variance
1
0.0886
0.0886
0.0886
0.0886
2
0.0684
0.0684
0.0684
0.0684
3
0.3515
0.3515
0.3515
0.3515
4
0.9874
0.9874
0.9874
0.9874
5
0.4713
0.4713
0.4713
0.4713
6
0.6115
0.6115
0.6115
0.6115
7
0.2573
0.2573
0.2573
0.2573
8
0.2914
0.2914
0.2914
0.2914
9
0.1662
0.1662
0.1662
0.1662
10
0.4400
0.4400
0.4400
0.4400
11
0.6939
?
0.3731
0.6629
New Data Construction
Constructive Operations on Selected Data Include:
 Derivation of new attributes from the existing
attributes.
 Generation of new records.
 Data Transformation.
 Merging Tables.
 Aggregation: Summarizing information from
multiple records and/or tables.
Data Formatting
It Involves Syntactic Modification Required by
Modeling Tools:
 Reordering of the attributes or records.
 Changes related to the constraints of the modeling
tools: e.g. removing comma or tabs, trimming
strings to maximum allowed number of
characters, replacing special characters with
allowed set of special characters.
Data Preparation Tools




Data Junction Integration Studio
- http://www.datajunction.com/
SPSS Base 11.5
- http://www.spss.com/
Informatica PowerCenter
- http://www.informatica.com/
WizWhy
- http://www.wizsoft.com/
Data Junction Integration Studio
It includes five visual design tools:
1) Process Designer
 Full conditional flow control
 Testing of global variables
 Execution of external processes and a
complete expression language allow for
automation of complex event-driven or
scheduled routines
 Multi-threaded Integration Engine
Data Junction Integration Studio Cont’d
2)
Map Designer
 Mapping source data to target structures
 Defining rules for mapping complex hierarchical
structures
 Define complex rules for record filtering
 Error and reject record handling
 Error logging
Data Junction Integration Studio Cont’d
3)
4)
Metadata Query
 Allows users to run queries against the Data
Junction Metadata Repository
Record Layout Designer
 A visual tool for defining or modifying data
structures (including field names, sizes, length,
offset, data types, etc.) for both sources and
targets
Data Junction Integration Studio Cont’d
5)
Universal Data Browser
 Allows users to view files other than the
sources and targets involved in a current
design session
 View data formats from applications not
installed on the system
SPSS Base 11.5 Data Preparation
Components
1) Data Editor: a spreadsheet-like system for defining,
entering, editing and displaying data
2) Data preparation tools: get data ready for analysis.
The Define Variable Properties tool to easily set up
data dictionary information (such as value labels,
variable labels and variable types) as a "template" so
it can be applied to other data files and to other
variables within the same file.
Apply the dictionary information using the Copy Data
Properties tool.
SPSS Base 11.5 Cont’d
3) Data Restructure Wizard: take a data file that
has multiple records per subject and restructure
it — so data for each subject are in a single
record. No need to set up vectors or loops.
Particularly helpful with transactional data.
Can also do the reverse action — that is, take
data from a single record and spread it across
multiple cases.
SPSS Base 11.5 Cont’d
4) Data transformations: work with combined data
more reliably by "flipping" responses — so all the
data are in the same direction.
e.g. Help to create multiple-item indices when working
with surveys that ask respondents to give both
positively worded and negatively worded responses.
And other transformation capabilities: such as
conditional transformation, compute new variables &
recode values
WizWhy
Features:
 Performs Boolean as well as multi-value analysis
 Analyzes the data by discovering all the if-then rules
 Reveals necessary and sufficient conditions (if-and-only-
if rules)
 Calculates the error probability of each rule
 Reveals the interesting phenomena in the data by
uncovering the unexpected rules
WizWhy Features cont’d
 Predicts
new cases on the basis of the discovered rules
 Explains predictions by listing relevant rules
 Calculates the prediction’s conclusive probability and error
probability
 Predictions are based on error costs (a cost of a miss vs.
false alarm) and not influenced by subjective choices
 Points
out cases deviating from the discovered rules
 Proven to be faster and more accurate than other data
mining methods
WizWhy Rules Report Example
1) CUSTOMER starts with MORGA
if and only if
KEY is 985
The rule exists in 32 records.
Significance Level: Error probability
is almost 0