Transcript Sampling

Data Transformation and Feature
Selection/Extraction
Qiang Yang
Thanks: J. Han, Isabelle Guyon, Martin Bachler
2015年4月13日星期
一
Data Mining: Concepts and
Techniques
1
Continuous Attribute Temperature
Outlook Tempreature Humidity Windy Class
Sunny
40 high
false
N
sunny
37 high
true
N
overcast
34 high
false
P
rain
26 high
false
P
rain
15 normal false
P
rain
13 normal true
N
overcast
17 normal true
P
sunny
28 high
false
N
sunny
25 normal false
P
rain
23 normal false
P
sunny
27 normal true
P
overcast
22 high
true
P
overcast
40 normal false
P
rain
31 high
true
N
Discretization

Three types of attributes:
 Nominal — values from an unordered set

Example: attribute “outlook” from weather data


Values: “sunny”,”overcast”, and “rainy”
Ordinal — values from an ordered set

Example: attribute “temperature” in weather data

Values: “hot” > “mild” > “cool”
Continuous — real numbers
Discretization:
 divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Supervised (entropy) vs. Unsupervised (binning)


Data Mining: Concepts and Techniques
3
Simple Discretization Methods: Binning

Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.



The most straightforward
But outliers may dominate presentation: Skewed data is not
handled well.
Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
Data Mining: Concepts and Techniques
4
Histograms
30
25
20
15
10
Data Mining: Concepts and Techniques
100000
90000
80000
70000
60000
0
50000
5
40000

35
30000

40
20000

A popular data reduction
technique
Divide data into buckets
and store average (sum)
for each bucket
Can be constructed
optimally in one
dimension using dynamic
programming
Related to quantization
problems.
10000

5
Supervised Method:
Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
| |
| |
E (S ,T ) 


S1 Ent (
| S|
S 2 Ent ( )
)

S1 | S |
S2
The boundary T that minimizes the entropy function
over all possible boundaries is selected as a binary
discretization.
Greedy Method:
 the process is recursively applied when T goes from
smallest to largest value of attribute A, until some
stopping criterion is met, e.g., for some user-given 
Ent(S )  E (T , S )  
Data Mining: Concepts and Techniques
6
How to Calculate ent(S)?

Given two classes Yes and No, in a set S,

Let p1 be the proportion of Yes

Let p2 be the proportion of No,

p1 + p2 = 100%
Entropy is:
ent(S) = -p1*log(p1) –p2*log(p2)

When p1=1, p2=0, ent(S)=0,

When p1=50%, p2=50%,
ent(S)=maximum!
Data Mining: Concepts and Techniques
7
Transformation: Normalization

min-max normalization
v  min A
v' 
(new _ max A  new _ min A)  new _ min A
max A  min A

z-score normalization
v' 

v

normalization by decimal scaling
v
v'  j
10
Where j is the smallest integer such that Max(| v ' |)<1
Data Mining: Concepts and Techniques
8
Transforming Ordinal to Boolean


Simple transformation allows to code ordinal attribute
with n values using n-1 boolean attributes
Example: attribute “temperature”
Temperature
Temperature > cold
Temperature > medium
Cold
False
False
Medium
True
False
Hot
True
True
Original data

Transformed data
How many binary attributes shall we introduce for
nominal values such as “Red” vs. “Blue” vs. “Green”?
Data Mining: Concepts and Techniques
9
Data Sampling
2015年4月13日星期
一
Data Mining: Concepts and
Techniques
10
Sampling



Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew (uneven) classes
Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or
subpopulation of interest) in the overall database
 Used in conjunction with skewed data
Data Mining: Concepts and Techniques
11
Sampling
Raw Data
Data Mining: Concepts and Techniques
12
Sampling Example
Cluster/Stratified Sample
Raw Data
Data Mining: Concepts and Techniques
13
Summary

Data preparation is a big issue for data mining

Data preparation includes transformation, which are:


Data sampling and feature selection

Discretization

Missing value handling

Incorrect value handling
Feature Selection and Feature Extraction
Data Mining: Concepts and Techniques
14