Lecture slides
Download
Report
Transcript Lecture slides
Data preparation: Selection,
Preprocessing, and Transformation
Literature:
I.H. Witten and E. Frank, Data Mining,
chapter 2 and chapter 7
1
Fayyad’s KDD Methodology
Transformed
data
Target
data
Patterns
Processed
data
Data Mining
data
Selection
Knowledge
Interpretation
Evaluation
Transformation
Preprocessing & feature
selection
& cleaning
2
Contents
Data Selection
Data Preprocessing
Data Transformation
3
Data Selection
Goal
Understanding the data
Explore the data:
possible attributes
their values
distribution, outliers
4
Getting to know the data
Simple visualization tools are very useful for
identifying problems
Nominal attributes: histograms (Distribution
consistent with background knowledge?)
Numeric attributes: graphs (Any obvious outliers?)
2-D and 3-D visualizations show dependencies
Domain experts need to be consulted
Too much data to inspect? Take a sample!
5
Data preprocessing
Problem: different data sources (e.g. sales
department, customer billing department, …)
Differences: styles of record keeping, conventions,
time periods, data aggregation, primary keys, errors
Data must be assembled, integrated, cleaned up
“Data warehouse”: consistent point of access
External data may be required (“overlay data”)
Critical: type and level of data aggregation
6
Data Preprocessing
Choose data structure (table, tree or set of
tables)
Choose attributes with enough information
Decide on a first representation of the
attributes (numeric or nominal)
Decide on missing values
Decide on inaccurate data (cleansing)
7
Attribute types used in practice
Most schemes accommodate just two levels of
measurement: nominal and ordinal
Nominal attributes are also called “categorical”,
“enumerated”, or “discrete”
But: “enumerated” and “discrete” imply order
Special case: dichotomy (“boolean” attribute)
Ordinal attributes are called “numeric”, or
“continuous”
But: “continuous” implies mathematical continuity
8
The ARFF format
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...
9
Attribute types
ARFF supports numeric and nominal attributes
Interpretation depends on learning scheme
Numeric attributes are interpreted as
ordinal scales if less-than and greater-than are used
ratio scales if distance calculations are performed
(normalization/standardization may be required)
Instance-based schemes define distance
between nominal values (0 if values are equal,
1 otherwise)
Integers: nominal, ordinal, or ratio scale?
10
Nominal vs. ordinal
Attribute “age” nominal
If age = young and astigmatic = no and
tear production rate = normal
then recommendation = soft
If age = pre-presbyopic and
astigmatic = no and
tear production rate = normal
then recommendation = soft
Attribute “age” ordinal
(e.g. “young” < “pre-presbyopic” < “presbyopic”)
If age pre-presbyopic and
astigmatic = no and
tear production rate = normal
then recommendation = soft
11
Missing values
Frequently indicated by out-of-range entries
Types: unknown, unrecorded, irrelevant
Reasons: malfunctioning equipment, changes in
experimental design, collation of different datasets,
measurement not possible
Missing value may have significance in itself
(e.g. missing test in a medical examination)
Most schemes assume that is not the case
“missing” may need to be coded as additional value
12
Inaccurate values
Reason: data has not been collected for mining it
Result: errors and omissions that don’t affect original
purpose of data (e.g. age of customer)
Typographical errors in nominal attributes
values need to be checked for consistency
Typographical and measurement errors in numeric
attributes outliers need to be identified
Errors may be deliberate (e.g. wrong zip codes)
Other problems: duplicates, stale data
13
Transformation
Attribute selection
Adding a random (i.e. irrelevant) attribute can
significantly degrade C4.5’s performance
Problem: attribute selection based on smaller and
smaller amounts of data
IBL is also very susceptible to irrelevant
attributes
Number of training instances required increases
exponentially with number of irrelevant attributes
Naïve Bayes doesn’t have this problem.
Relevant attributes can also be harmful
14
Scheme-independent selection
Filter approach: assessment based on general
characteristics of the data
One method: find subset of attributes that is enough
to separate all the instances
Another method: use different learning scheme (e.g.
C4.5, 1R) to select attributes
IBL-based attribute weighting techniques can also be
used (but can’t find redundant attributes)
CFS: uses correlation-based evaluation of subsets
15
Attribute subsets for weather data
16
Searching the attribute space
Number of possible attribute subsets is
exponential in the number of attributes
Common greedy approaches: forward
selection and backward elimination
More sophisticated strategies:
Bidirectional search
Best-first search: can find the optimum solution
Beam search: approximation to best-first search
Genetic algorithms
17
Scheme-specific selection
Wrapper approach: attribute selection implemented as
wrapper around learning scheme
Evaluation criterion: cross-validation performance
Time consuming: adds factor k2 even for greedy
approaches with k attributes
Linearity in k requires prior ranking of attributes
Scheme-specific attribute selection essential for
learning decision tables
Can be done efficiently for DTs and Naïve Bayes
18
Discretizing numeric attributes
Can be used to avoid making normality
assumption in Naïve Bayes and Clustering
Simple discretization scheme is used in 1R
C4.5 performs local discretization
Global discretization can be advantageous
because it’s based on more data
Learner can be applied to discretized attribute or
It can be applied to binary attributes coding the cut
points in the discretized attribute
19
Unsupervised discretization
Unsupervised discretization generates intervals
without looking at class labels
Only possible way when clustering
Two main strategies:
Equal-interval binning
Equal-frequency binning (also called histogram
equalization)
Inferior to supervised schemes in classification
tasks
20
Entropy-based discretization
Supervised method that builds a decision tree
with pre-pruning on the attribute being
discretized
Entropy used as splitting criterion
MDLP used as stopping criterion
State-of-the-art discretization method
Application of MDLP:
“Theory” is the splitting point (log2[N-1] bits) plus
class distribution in each subset
DL before/after adding splitting point is compared
21
Example: temperature attribute
22
Formula for MDLP
N instances and
k classes and entropy E in original set
k1 classes and entropy E1 in first subset
k2 classes and entropy E2 in first subset
log 2 ( N 1) log 2 (3k 2) kE k1E1 k2 E2
gain
N
N
Doesn’t result in any discretization intervals
for the temperature attribute
23
Other discretization methods
Top-down procedure can be replaced by
bottomup method
MDLP can be replaced by chi-squared test
Dynamic programming can be used to find
optimum k-way split for given additive
criterion
Requires time quadratic in number of instances if
entropy is used as criterion
Can be done in linear time if error rate is used as
evaluation criterion
24
Transformation
WEKA provides a lot of filters that can help
you transforming and selecting your
attributes!
Use them to build an promising model for the
caravan data!
25