#### Transcript Chi-squared Test and Principle Component Analysis

```Data Mining:
Concepts and Techniques
— Chapter 3 —
Cont.
More on Feature Selection:


Chi-squared test
Principal Component Analysis
1
Attribute Selection

Question: Are attributes
A1 and A2 independent?
 If they are very
dependent, we can
remove either
A1 or A2
 If A1 is independent
on a class attribute A2,
we can
remove A1 from our
training data
ID
Outlook
Temperature
Humidity
Windy
Play
1
100
40
90
0
T
2
100
40
90
1
F
3
50
40
90
0
T
4
10
30
90
0
T
5
10
15
70
0
T
6
10
15
70
1
F
7
50
15
70
1
T
8
100
30
90
0
F
9
100
15
70
0
T
10
10
30
70
0
F
11
100
30
70
1
F
12
50
30
90
1
T
13
50
40
70
0
T
2
Deciding to remove attributes in feature
selection
Dependent
(ChiSq=small)
A2=class attribute
A1
A2 = class attribute
Independent
(Chisq=large
?
?
Dependent
(ChiSq=small)
?
Independent
(Chisq=large
?
3
Chi-Squared Test (cont.)

Question: Are attributes A1 and A2 independent?

These features are nominal valued (discrete)

Null Hypothesis: we expect independence
Outlook
Temperature
Sunny
High
Cloudy
Low
Sunny
High
4
The Weather example: Observed Count
temperature

High
Low
Outlook
Subtotal
Outlook
Sunny
Cloudy
Temperat
ure
Subtotal:
2
0
2
0
1
1
2
1
Outlook
Temperat
ure
Sunny
High
Cloudy
Low
Sunny
High
Total
count in
table =3
5
The Weather example: Expected Count
If attributes were independent, then the subtotals would be
Like this (this table is also known as
temperature

High
Low
Subtotal
Outlook
Outlook
Temperat
ure
Sunny
High
Sunny
2*2/3=4/3
=1.3
2*1/3=2/3
=0.6
2
Cloudy
Low
Cloudy
2*1/3=0.6
1*1/3=0.3
1
Sunny
High
Subtotal:
2
1
Total
count in
table =3
6
Question: How different between
observed and expected?
•If Chi-squared value is very large, then A1 and A2 are not
independent  that is, they are dependent!
•Degrees of freedom: if table has n*m items, then freedom = (n1)*(m-1)
•In our example
•Degree = 1
•Chi-Squared=?
7
Chi-Squared Table: what does it mean?

If calculated value is much greater than in the table, then you have reason
to reject the independence assumption

When your calculated chi-square value is greater than the chi2 value shown in the 0.05
column (3.84) of this table  you are 95% certain that attributes are
actually dependent!

i.e. there is only a 5% probability that your calculated X2 value would occur by chance
8
Example Revisited (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html)
We don’t have to have two-dimensional count table (also known as
contingency table)

Suppose that the ratio of male to female students in the Science
Faculty is exactly 1:1,

But, the Honours class over the past ten years there have been 80
females and 40 males.

Question: Is this a significant departure from the (1:1) expectation?
Observed
Male
Female
Total
40
80
120
Honours
9
Expected (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html)




Suppose that the ratio of
male to female students in
the Science Faculty is
exactly 1:1,
but in the Honours class over
the past ten years there have
been 80 females and 40
males.
Question: Is this a
significant departure from
the (1:1) expectation?
Note: the expected is filled
in, from 1:1 expectation,
Expected
Male
Female
Total
60
60
120
Honours
10
Chi-Squared Calculation
Female
Male
Total
Observed numbers
(O)
80
40
120
Expected numbers
(E)
60
60
120
O-E
20
-20
0
(O-E)2
400
400
(O-E)2 / E
6.67
6.67 Sum=13.34 = X2
11
Chi-Squared Test (Cont.)

Then, check the chi-squared table for significance


http://helios.bto.ed.ac.uk/bto/statistics/table2.html#Chi%20squared%20test
Compare our X2 value with a c2 (chi squared) value
in a table of c2 with n-1 degrees of freedom



n is the number of categories, i.e. 2 in our case -- males
and females).
We have only one degree of freedom (n-1). From the c2
table, we find a "critical value of 3.84 for p = 0.05.
13.34 > 3.84, and the expectation (that the
Male:Female in honours major are 1:1) is
wrong!
12
Chi-Squared Test in Weka:
weather.nominal.arff
@relation weather.symbolic
@attribute
@attribute
@attribute
@attribute
@attribute
outlook {sunny, overcast, rainy}
temperature {hot, mild, cool}
humidity {high, normal}
windy {TRUE, FALSE}
play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
13
Chi-Squared Test in Weka
14
Chi-Squared Test in Weka
15
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?
A1?
Class 1
>
Class 2
Class 1
Class 2
Reduced attribute set: {A1, A4, A6}
16
Principal Component Analysis

Given N data vectors from k-dimensions, find c <= k orthogonal vectors
that can be best used to represent data


The original data set is reduced to one consisting of N data vectors on
c principal components (reduced dimensions)
Each data vector Xj is a linear combination of the c principal component
vectors Y1, Y2, … Yc




Xj= m+W1*Y1+W2*Y2+…+Wk*Yc, i=1, 2, … N
M is the mean of the data set
W1, W2, … are the ith components
Y1, Y2, … are the ith Eigen vectors

Works for numeric data only

Used when the number of dimensions is large
17
Principal Component Analysis
See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/student_
X2
tutorials/principal_components.pdf

Y1
Y2
x
Note: Y1 is
the first
eigen vector,
Y2 is the
second. Y2
ignorable.
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
X1
Key observation:
variance = largest!
18
Principal Component Analysis: one
Temperature
attribute first
42
40


Question: how much
along the axis? (distance
to the mean)
Variance=Standard
n
deviation^2
s 
2
(Xi  X )
i 1
(n  1)
24
30
15
18
15
30
15
2
30
35
30
40
30
19
Now consider two dimensions
X=Temperature
Covariance: measures the
correlation between X and Y
• cov(X,Y)=0: independent
•Cov(X,Y)>0: move same dir
•Cov(X,Y)<0: move oppo dir
n
cov(X , Y ) 
(X
i
 X )(Yi  Y )
i 1
(n  1)
Y=Humidity
40
90
40
90
40
90
30
90
15
70
15
70
15
70
30
90
15
70
30
70
30
70
30
90
40
70
30
90
20
More than two attributes: covariance
matrix

Contains covariance values between all
possible dimensions (=attributes):
C

nxn
 (cij | cij  cov( Dimi , Dim j ))
Example for three attributes (x,y,z):
 cov( x, x) cov( x, y ) cov( x, z ) 


C   cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z, x) cov( z, y ) cov( z, z ) 


21
Background: eigenvalues AND
eigenvectors



Eigenvectors e : C e = e
How to calculate e and :
 Calculate det(C-I), yields a polynomial (degree n)
 Determine roots to det(C-I)=0, roots are eigenvalues 
Check out any math book such as
 Elementary Linear Algebra by Howard Anton, Publisher
John,Wiley & Sons
 Or any math packages such as MATLAB
22
Steps of PCA




Let X be the mean
vector (taking the mean
of all rows)
by the mean
 X’ = X – X
Compute the covariance
Find the eigenvectors
and eigenvalues of C.


For matrix C, vectors e
(=column vector) having
same direction as Ce :
 eigenvectors of C is e
such that Ce=e,
  is called an eigenvalue
of C.
Ce=e  (C-I)e=0

Most data mining
packages do this for
you.
23
Steps of PCA (cont.)

Calculate eigenvalues  and eigenvectors e for covariance matrix:
 Eigenvalues j corresponds to variance on each component j
 Thus, sort by j
 Take the first n eigenvectors ei; where n is the number of top
eigenvalues
 These are the directions with the largest variances
 y11   e1  x11  x1 

   
 y12   e2  x12  x2 

 ...    ... 
    ... 
 y   e  x  x 
 1n   n  1n
n
24
An Example
X1
X2
Mean1=24.1
Mean2=53.8
X1'
X2'
100
90
80
19
63
-5.1
9.25
70
60
50
Series1
40
39
74
14.9
20.25
30
20
10
30
87
5.9
33.25
30
23
5.9
-30.75
0
0
10
20
30
40
50
40
15
35
-9.1
30
-18.75
20
10
15
43
-9.1
-10.75
0
-15
15
32
-9.1
-21.75
-10
-5
-10
Series1
0
5
10
15
20
-20
-30
30
73
-40
5.9
19.25
25
Covariance Matrix


C=
75 106
106 482
Using MATLAB, we find out:
 Eigenvectors:
 e1=(-0.98,-0.21), 1=51.8
 e2=(0.21,-0.98), 2=560.2

Thus the second eigenvector is more important!
26
If we only keep one dimension: e2
yi
0.5


We keep the dimension of
e2=(0.21,-0.98)
We can obtain the final
data as
-40
-20
0.4
-10.14
0.3
-16.72
0.2
-31.35
0.1
0
31.374
-0.1 0
20
16.464
-0.2
8.624
-0.3
19.404
-0.4
-0.5
yi  xi1
40
-17.63
 0.21 
  0.21 * xi1  0.98 * xi 2
xi 2 
  0.98 
27
Using Matlab to figure it out
28
PCA in Weka
29
Wesather Data from UCI Dataset (comes
with weka package)
@relation weather
@attribute
@attribute
@attribute
@attribute
@attribute
outlook {sunny, overcast, rainy}
temperature real
humidity real
windy {TRUE, FALSE}
play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
30
PCA in Weka (I)
31
32
Summary of PCA



PCA is used for reducing the number of
numerical attributes
The key is in data transformation
 Find eigenvectors for covariance matrix
 Transform data
Note: only linear combination of data (weighted
sum of original data)
33
Missing and Inconsistent values

Linear regression: Data are
modeled to fit a straight line

least-square method to fit
Y=a+bX

Multiple regression: Y = b0 +
b1 X1 + b2 X2.

( X  X )(Y  Y )

b
(X  X )
2
a  Y  bX
Many nonlinear functions
can be transformed into
the above.
34
Regression
Height
Y1
Y1’
y=x+1
X1
Age
35
Clustering for
Outlier detection

Outliers can be incorrect data. Clusters 
majority behavior
36
Data Reduction with Sampling



Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew (uneven) classes
 Stratified sampling:
 Approximate the percentage of each class (or
subpopulation of interest) in the overall database
 Used in conjunction with skewed data
37
Sampling
Raw Data
38
Sampling Example
Cluster/Stratified Sample
Raw Data
39
Summary

Data preparation is a big issue for data mining

Data preparation includes

Data warehousing

Data reduction and feature selection

Discretization

Missing values

Incorrect values

Sampling
40
```