PPT - Bo Yuan
Download
Report
Transcript PPT - Bo Yuan
Data Preprocessing
Lecturer: Dr. Bo Yuan
LOGO
E-mail: [email protected]
Outline
Data Cleansing
Data Transformation
Data Description
Feature Selection
Feature Extraction
2
Where are data from?
3
Why Data Preprocessing?
Real data are notoriously dirty!
The biggest challenge in many data mining projects
Incomplete
Occupancy = “ ”
Noisy
Salary = “-100”
Inconsistent
Age = “42” vs. Birthday = “01/09/1985”
Redundant
Too much data or too many features for analytical analysis
Others
Data types
Imbalanced datasets
4
Missing Data
Data are not always available.
One or more attributes of a sample may have empty values.
Many data mining algorithms cannot handle missing values directly.
May cause significant troubles.
Possible Reasons
Equipment malfunction
Data not provided
Not Applicable (N/A)
Different Types
Missing completely at random
Missing at random
Not missing at random
5
How to handle missing data?
Ignore
Remove samples/attributes with missing values
The easiest and most straightforward way
Work well with low missing rates
Fill in the missing values manually
Recollect the data
Domain knowledge
Tedious/Infeasible
Fill in the missing values automatically
A global constant
The mean or median
Most probable values
More art than science
6
Missing Data: An Example
P(Y|X)
M_Y
Y
M_Y
M_Y
X
7
Noisy Data
1.5
1
0.5
0
1.5
-0.5
1
-1
0.5
-1.5
0
1
2
3
4
5
6
7
5
6
7
0
-0.5
-1
1.5
-1.5
0
1
2
3
4
5
6
7
1
0.5
0
-0.5
Random Noise
-1
8
-1.5
0
1
2
3
4
Outliers
y
Outliers
y=x+1
x
9
Outliers
Outliers
10
Anomaly Detection
11
Local Outlier Factor
k=3
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝑂
p3
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐴, 𝐵 = 𝑚𝑎𝑥 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐵 , 𝑑 𝐴, 𝐵
𝑙𝑟𝑑 𝐴 = 1
𝐿𝑂𝐹𝑘 𝐴 =
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐴, 𝐵
𝑁𝑘 𝐴
𝑙𝑟𝑑 𝐵
𝑙𝑟𝑑 𝐴
=
𝑁𝑘 𝐴
𝐵∈𝑁𝑘 𝐴
𝐵∈𝑁𝑘 𝐴
𝑙𝑟𝑑 𝐵
𝑁𝑘 𝐴
12
p1
p2
p4
𝐵∈𝑁𝑘 𝐴
o
𝑙𝑟𝑑 𝐴
Local Outlier Factor
13
Duplicate Data
14
Duplicate Data
15
Duplicate Data
Create
Keys
Sort
Merge
16
Data Transformation
Now we have an error free dataset.
Still needs to be standardized.
Type Conversion
Normalization
Sampling
17
Attribute Types
Continuous
Real values: Temperature, Height, Weight …
Discrete
Integer values: Number of people …
Ordinal
Rankings: {Average, Good, Best}, {Low, Medium, High} …
Nominal
Symbols: {Teacher, Worker, Salesman}, {Red, Green, Blue} …
String
Text: “Tsinghua University”, “No. 123, Pingan Avenue” …
18
Size
Type Conversion
×
○
×
×
○
×
○
×
○
○
×
○
Color
19
Size
Type Conversion
×
×
○
×
×
○
○
○
×
○
○
×
Color
20
Type Conversion
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
21
Normalization
The height of someone can be 1.7 or 170 or 1700 …
Min-max normalization:
v
v min
(new _ max new _ min) new _ min
max min
Let income range $12,000 to $98,000 be normalized to [0.0, 1.0].
Then $73,600 is mapped to
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
Z-score normalization (μ: mean, σ: standard deviation):
v
v
Let μ = 54,000, σ = 16,000. Then
73,600 54,000
1.225
16,000
22
Sampling
A database/data warehouse may store terabytes of data.
Processing limits: CPU, Memory, I/O …
Sampling is applied to reduce the time complexity.
In statistics, sampling is applied often because obtaining the
entire set of data is expensive.
Aggregation
Change of scale:
• Cities States;
Days Months
More stable and less variability
Sampling can be also used to adjust the class distributions.
Imbalanced dataset
23
Imbalanced Datasets
LHS
10%
Classifier A
5%
85%
Classifier B
24
RHS
Imbalanced Datasets
25
Over-Sampling
SMOTE
26
Data Description
Mean
Arithmetic mean
Median
Mode
The most frequently occurring value in a list
Can be used with non-numerical data.
Variance
Degree of diversity
27
Data Description
Pearson’s product moment correlation coefficient
rA, B
( A A)(B B) ( AB) n AB
(n 1)AB
(n 1)AB
If rA, B > 0, A and B are positively correlated.
If rA, B = 0, no linear correlation between A and B.
If rA, B < 0, A and B are negatively correlated.
28
Data Description
Pearson's chi-square (χ2) test
2
(
Observed
Expected
)
2
Expected
Play chess
Not play chess
Sum (row)
Like science fiction
250 (90)
200 (360)
450
Not like science fiction
50 (210)
1000 (840)
1050
Sum (col.)
300
1200
1500
(250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2
507.93
90
210
360
840
2
29
Data Visualization (1D)
Group
10
22%
9
8
11%
7
6
28%
5
4
3
2
33%
1
6%
0
30
1
2
3
4
5
Data Visualization (2D)
26
24
22
Acceleration
20
18
16
14
12
10
8
0
50
100
150
200
250
300
Displacement
31
350
400
450
500
Surf Plots (3D)
5
0
-5
-10
2
0
-2
-3
-2
-1
0
1
2
3
32
Box Plots (High Dimensional)
6
Values
4
2
0
-2
-4
1
2
3
Dimension
33
4
Parallel Coordinates (High Dimensional)
34
Guess Visualization
35
CiteSpace
36
Gephi
37
Feature Selection
ID
Weight
Age
Noisy?
Gender
Height
Irrelevant?
Education
Duplicate?
Income
Complexity?
Address
Occupation
Address
38
Class Distributions
39
Class Distributions
1.2
Female
1
Probability
0.8
0.6
0.4
0.2
0
Total
Non-Smoker (60%)
Smoker (40%)
40
Male
Entropy
𝑛
𝐻 𝑋 =−
𝑝 𝑥𝑖 log 𝑏 𝑝 𝑥𝑖
𝑖=1
H (S ) 0.5 log2 0.5 0.5 log2 0.5 1.0
H (S | X a ) 0.8 log2 0.8 0.2 log2 0.2 0.7219
H (S | X b ) 0.05 log2 0.05 0.95 log2 0.95 0.2864
H (S | X ) 0.6 H (S | X a ) 0.4 H (S | X b ) 0.5477
Gain (S ,X ) H (S ) H (S | X ) 0.4523
41
Feature Subset Search
Exhaustive
All possible combinations
C103
10!
120
(10 3)!3!
5
C20
20!
15504
(20 5)!5!
Branch and Bound
S1 S 2 S3
J (S1 ) J (S2 ) J (S3 )
42
Feature Subset Search
Top K Individual Features
J (X k ) {J (x 1 ), ,J (x k )}, J (x 1 ) J (x 2 ) ... J (x k )
Sequential Forward Selection
J (X k x 1 ) J (X k x 2 ) J (X k x D k ), x i X k
Sequential Backward Selection
J (X k x 1 ) J (X k x 2 ) J (X k x k ), x i X k
Optimization Algorithms
Simulated Annealing
Tabu Search
Genetic Algorithms
43
Feature Extraction
44
Principal Component Analysis
45
2D Example
46
2D Example
47
Some Math
Goal: S(Y) has nonzero diagonal entries and all off-diagonal elements are zero.
48
A Different View
𝑥𝑘
𝑛
2
′
𝐽 𝑒 =
𝛼𝑘 = 𝑒 𝑡 𝑥𝑘
𝑥𝑘 − 𝑥𝑘
𝑥𝑘
𝑒 =1
𝑖=1
′
𝛼𝑘
𝑛
=
2
𝛼𝑘 𝑒 − 𝑥𝑘
𝑖=1
𝑛
𝑛
𝛼𝑘 2 𝑒
=
2
𝑖=1
𝑛
𝑛
𝑖=1
2
𝑖=1
𝑥𝑘
max 𝑒 𝑡 𝑆𝑒
2
𝑒
𝑠.𝑡. 𝑒 =1
𝑖=1
𝑡
𝑛
𝑒 𝑡 𝑥𝑘 𝑥𝑘 𝑒 +
=−
𝑥𝑘
𝑛
𝛼𝑘 2 +
𝑖=1
𝛼𝑘 𝑒 𝑡 𝑥𝑘 +
−2
𝑖=1
=−
𝑛
𝑛
𝑥𝑘
2
𝑆=
𝑖=1
𝑥𝑘 𝑥𝑘
𝑖=1
49
𝑡
Lagrange Multipliers
max f ( x, y ) 3xy
x, y
s.t. 2 x y 8
F ( x, y, ) 3xy (2 x y 8)
Fx 3 y 2
6
x
2
3
2
y 4
3
f ( 2,4) 3 2 4 24
Fy 3 x
F 2 x y 8
50
More Math ...
u et Se et e 1
u
2Se 2e
e
Se e
eigenvector
eigenvalue
et Se et e
PCA
To project the original data to the eigenvectors
of S with the largest eigenvalues.
3
2 1 1
1
=
=3
3
1 2 1
1
2 1 1
1
=
1 2 −1
−1
51
PCA Examples
1
0.8
;
1
0.8
x mvnrnd(zeros(10000,2),s );
s
plot(x(:,1),x(:,2),'.');
axis equal;
grid on;
[V ,D ] eig(s );
0.7071 0.7071
;
0.7071 0.7071
V
0.2
0
;
1.8
0
newx x * V (:,2);
D
hist(newx ,50);
52
Fish
53
Fish
54
The Issue of PCA
Now, let’s consider class information ...
55
Linear Discriminant Analysis
The objective of LDA is to perform dimension reduction while preserving
as much of the class discriminatory information as possible.
Given a set of d-D vectors x1, x2, ..., xn of which N1 belong to ω1, and N2
to ω2. Of all the possible projection lines y=wTx, find the one that
maximizes the separability.
56
Measure of Separability
57
Fisher Criterion
J
1 2
2
S12 S 22
58
Some Math …
between-class scatter
within-class scatter
59
More Math...
generalized Rayleigh quotient
60
More Math...
SW1SB w Jw
eigenvector problem!
S B w 1 2 1 2 w 1 2 R
T
R 1 2 w
T
scalar
Jw S w1 S B w S w1 1 2 R
w
R 1
S w 1 2
J
61
LDA Example
Dataset
𝐶1 =
1,2 ; 2,3 ; 3,3 ; 4,5 ; (5,5)
𝐶2 =
1,0 ; 2,1 ; 3,1 ; 3,2 ; 5,3 ; (6,5)
Covariance of [c1; c2]
Z=
2.7636
2.2545
2.2545
3.0182
Eigenvectors and Eigenvalues of Z
𝑉=
−0.7268
0.6869
𝐷=
0.6328
0
0.6869
0.7268
0
5.1490
The direction of PCA projection: 0.6869, 0.7268
62
𝑇
LDA Example
The mean of each class
𝜇1 = 𝑚𝑒𝑎𝑛 𝑐1 = 3.0, 3.6
𝑇
𝜇2 = 𝑚𝑒𝑎𝑛 𝑐2 = 3.3, 2.0
𝑇
𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 =
0.11
−0.53
−0.53
2.56
The scatter of each class
𝑆1 = 4 × 𝑐𝑜𝑣 𝑐1 =
10
8
𝑆2 = 5 × 𝑐𝑜𝑣 𝑐2 =
17.3 16
16 16
𝑆𝑤 = 𝑆1 + 𝑆2 =
27.3
24
8
7.2
24
23.2
63
LDA Example
𝑆𝑤 −1 𝑆𝐵 =
0.26
−0.30
−1.27
1.42
Eigenvectors and Eigenvalues of 𝑆𝑤 −1 𝑆𝐵
𝑉=
−0.98 0.67
−0.20 −0.75
𝐷=
0
0
0 1.69
The direction of LDA projection: 0.6656, −0.7463
Alternatively
𝑆𝑤 −1 𝜇1 − 𝜇2
𝑇
= −0.7936, 0.8899
𝑇
After normalization: −0.6656, 0.7463
𝑇
64
𝑇
LDA Example
6
PCA
5
4
3
2
1
0
-1
-2
LDA
-3
-3
-2
-1
0
1
2
3
4
65
5
6
C-Class LDA
66
C-Class LDA
For C-class LDA with C=2, SB is defined as:
S B N1 1 1 N 2 2 2
T
T
N N
N N
N N
N N
N1 1 1 1 2 2 1 1 1 2 2 N 2 2 1 1 2 2 2 1 1 2 2
N
N
N
N
T
N N N N
N N N N
N1 2 1 2 2 2 1 2 2 N 2 1 2 1 1 1 2 1 1
N
N
N
N
T
N1 N 22
N12 N 2
T
T
2 1 2 1 2 2 2 1 2 1
N
N
N1 N 2
1 2 1 2 T
N
67
T
T
C-Class LDA
N1=1000; N2=500; N3=2000;
10
X1=mvnrnd([0,0], [1,0;0,1], N1);
8
X2=mvnrnd([3,6], [1,-0.8;-0.8,1], N2);
6
X3=mvnrnd([10,6], [1,0.8;0.8,1], N3);
4
S1=(N1-1)*cov(X1); S2=(N2-1)*cov(X2); S3=(N3-1)*cov(X3);
2
Sw=S1+S2+S3;
0
M1=mean(X1); M2=mean(X2); M3=mean(X3);
Mu=(N1*M1+N2*M2+N3*M3)/(N1+N2+N3);
-2
Sb=N1*(M1-Mu)'*(M1-Mu)+N2*(M2-Mu)'*(M2-Mu) +N3*(M3-Mu)'*(M3-Mu);
-4
-4
-2
0
2
4
6
8
10
12
14
J=inv(Sw)*Sb;
[V,D]=eig(J);
600
600
500
500
400
400
300
300
200
200
100
100
0
-4
-2
0
2
4
6
8
10
12
14
16
0
-4
-2
0
68
2
4
6
8
Limitations of LDA
LDA produces at most C-1 projections
SB is a matrix with rank C-1 or less.
SW may be singular.
LDA does not work well when ...
69
Reading Materials
M. A. Hernandez and S. J. Stolfo, “Real-world Data is Dirty: Data Cleansing and The
Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, pp. 9–37, 1998.
A. Donders, G. van der Heijden, T. Stijnen, and K. Moons, “Review: a gentle introduction to
imputation of missing values,” Journal of Clinical Epidemiology, vol. 59, pp. 1087-1091, 2006.
N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: synthetic minority
over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent
Data Analysis vol. 6, pp. 429–449, 2002.
D. Keim, “Information Visualization and Visual Data Mining,” IEEE Transactions on
Visualization and Computer Graphics, vo. 7(1), pp. 100-107, 2002.
PCA Tutorials
http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Lagrange Multipliers
http://diglib.stanford.edu:8091/~klein/lagrange-multipliers.pdf
70
Review
How to handle missing values?
How to detect outliers?
What is the major issue with imbalanced datasets?
Why is feature selection important?
How to search for a good subset of features?
What does PCA stand for?
What is the major issue of PCA?
What does LDA stand for?
What is the major issue of LDA?
71
Next Week’s Class Talk
Volunteers are required for next week’s class talk.
Topic: Data Visualization Tools
Length: 20 minutes plus question time
Suggested Points of Interest
Software
• CiteSpace
• GUESS
• Gephi
Functionality
Demos
Features
Case Studies
72