Random Projection - GMU Computer Science

Download Report

Transcript Random Projection - GMU Computer Science

INFS 795
Dr. Domeniconi
Comparison of Principal
Component Analysis and Random
Projection in Text Mining
Steve Vincent
April 29, 2004
1
Outline
Introduction
Previous Work
Objective
Background on Principal Component Analysis
(PCA) and Random Projection (RP)
Test Data Sets
Experimental Design
Experimental Results
Future Work
2
Introduction
“Random projection in dimensionality
reduction: Applications to image and text
data” from KDD 2001, by Bingham and
Mannila compared principal component
analysis (PCA) to random projection (RP) for
text and image data
For future work, they said: “A still more
realistic application of random projection
would be to use it in a data mining
problem”
3
Previous Work
In 2001, Bingham and Mannila compared PCA to
RP for images and text
In 2001, Torkkola discussed both Latent
Semantic Indexing (LSI) and RP in classifying
text for very low dimension levels


LSI is very similar to PCA for text data
Used the Reuters-21578 data base
In 2003, Fradkin and Madigan discussed
background of RP
In 2003, Lin and Gunopulos combined LSI with
RP

No real data mining comparison between the two
4
Objective
Principal Component Analysis (PCA):


Find components that make projections uncorrelated
by selecting the highest eigenvalues of the
covariance matrix
Maximizes retained variance
Random Projection (RP)


Find components that make projections uncorrelated
by multiplying by a random matrix
Minimizes computations for a particular dimension
size
Determine whether RP is a viable dimensionality
reduction method
5
Principal Component Analysis
Normalize the input data then center the input data by
subtracting the mean which results in X, used below
Compute the global mean and covariance matrix of X:
Covariance
1
Σ
N
N
n
n
T
(
x

x
)(
x

x
)

n 1
Compute the eigenvalues and eigenvectors of the
covariance matrix
Arrange eigenvectors in the order of magnitude of their
eigenvalues. Take the first d eigenvectors as principle
components.
Put the d eigenvectors as columns in a matrix M.
Determine the reduced output E by multiplying M by X
6
Random Projection
With X being an n x p matrix calculate E
1
using:
E
XP
q
with projection matrix P and q is the number
of reduced dimensions
P, p x q, is a matrix with elements rij

rij = random Gaussian
P can also be constructed in one of the
following ways:
 rij = ±1 with probability of 0.5 each
 rij = 3 *(±1) with probability of 1/6 each, or 0 with a
probability of 2/3
7
SPAM Email Database
SPAM E-mail Database, generated
June/July 1999
Determine whether email is spam or not

Previous tests have generated an 7%
misclassification error
Source of data:
http://www.ics.uci.eud/~mlearn/MLRepository.html
Number of instances: 4,601 (1,813 Spam
= 39.4%)
8
SPAM Email Database
Number of attributes: 58
Attributes:






48 attributes = word frequency
6 attributes = character frequency
1 attribute = average length of uninterrupted
sequence of capital letters
1 attribute = longest uninterrupted sequence of
capital letters
1 attribute = sum of the length of uninterrupted
sequences of capital letters
1 attribute = class spam (1=Spam, 0=Not Spam)
9
Yahoo News Categories
Introduced in “Impact of Similarity Measures
on Web-Page Clustering” by Alexander Strehl,
et al.
 Located at: ftp://ftp.cs.umn/dept/users/boley/PDDPdata/
Data consists of 2,340 documents in 20
Yahoo news categories.
After stemming, the data base consists of
21,839 words
Strehl was able to reduce the number of
words to 2,903 by selecting only those words
that appear in 1% to 10% of all articles
10
Yahoo News Categories
Category
Business
No.
Category
Number of
documents in
each category
No.
142 E: Online
65
9 E: People
248
E: Art
24 E: Review
158
E: Cable
44 E: Stage
E: Culture
74 E: Television
Entertainment (E)
E: Film
278 E: Variety
18
187
54
E: Industry
70 Health
494
E: Media
21 Politics
114
E: Multimedia
14 Sports
141
E: Music
125 Technology
60
2,340 total
11
Revised Yahoo News
Categories
Category
Business
Entertainment (Total)
No.
142
1,389
Health
494
Politics
114
Sports
141
Technology
Combined 15
Entertainment
categories in one
category
60
12
Yahoo News Characteristics
With the various simplifications and
revisions, the Yahoo News Database
has the following characteristics:



2,340 documents
2,903 words
6 categories
Even with these simplifications and
revisions, there are still too many
attributes to do effective data mining
13
Experimental Design
Perform PCA and RP on each data set for wide
range of dimension numbers


Run RP multiple times due to random nature of
algorithm
Determine relative times for each reduction
Compare PCA and RP results in various data
mining techniques


This would include Naïve Bayes, Nearest Neighbor
and Decision Trees
Determine relative times for each technique
Compare PCA and RP on time and accuracy
14
Retained Variance
Retained Variance (r) is the percentage of the
original variance that the PCA reduced data set
covers, the equation for this is:
d
r
l
i 1
m
i
l
i 1
*100%
i
where li are the eigenvalues, m is the original
number of dimensions, and d is the reduced
number dimensions.
In many applications, r should be above 90%
15
Retained Variance Percent
SPAM Database
Yahoo News Database
Dimension # Retained Variance
5
26.4
10
38.3
15
48.2
20
57.3
25
65.7
30
73.6
35
80.6
40
87.0
45
92.5
50
96.8
Dimension # Retained Variance
25
10.92
50
16.95
100
26.44
150
34.28
200
41.06
250
47.04
300
52.37
600
75.22
16
PCA and RP Time Comparison
SPAM Database
Dimension #
5
10
15
RP averages
20
over 10 times
25
faster than PCA
30
35
40
45
50
PCA
time
1.3
1.2
1.4
1.4
1.2
1.4
1.2
1.2
1.2
1.2
Average
RP
Time PCA/RP
0.057
23.0
0.075
16.1
0.081
17.1
0.088
15.7
0.100
11.7
0.109
13.2
0.125
9.4
0.126
9.4
0.141
8.6
0.155
8.0
Time of PCA
divided by Time
of RP
Ran RP 5
times for
each
dimension
Times in
Seconds
Reduction performed in Matlab on Pentium III 1 GHz computer with 256 MB RAM
17
PCA and RP Time Comparison
Yahoo News Database
Dimension #
25
50
100
RP averages
150
over 100 times
200
faster than PCA
250
300
600
PCA
Time
2404
2223
2342
2397
2378
2685
2489
2551
RP
Average
Time
PCA/RP
6.02
399
7.66
290
11.34
207
14.09
170
16.95
140
18.55
145
23.80
105
33.74
76
Time of PCA
divided by the
Time of RP
Ran RP 5
times for
each
dimension
Times in
Seconds
Reduction performed in Matlab on Pentium III 1 GHz computer with 256 MB RAM
18
Data Mining
Explored various data mining techniques
using the Weka software package. The
following produced the best results:


IB1: Nearest Neighbor
J48: Decision Trees
The following produced poor results are will
not be used:


Naïve Bayes: Overall poor results
SVM (SVO): Too slow with similar results to others
19
Data Mining Procedures
For each data set imported into Weka:



Convert the numerical categories to nominal values
Randomize the order of the entries
Run J48 and IB1 the data
 Determine % Correct and check F-Measure statistics
Ran PCA once for each dimension number and
RP 5 times for each dimension number


Used 67% training/33% testing split
Tested on 1564 for SPAM and 796 for Yahoo
20
Results-J48 Spam Data
Percent Correct
# Dim PCA
RP Max RP Min Full Set
5 90.601 76.726 73.210
10 89.578 76.343 72.251
15 90.921 75.000 72.187
20 89.962 76.662 74.552
25 89.578 75.064 74.489
30 89.962 79.540 74.744
35 88.619 75.831 74.425
40 88.875 77.749 74.041
45 90.345 79.220 75.512
58
91.432
• PCA gave uniformly
good results for all
dimension levels
• PCA gave results
comparable to the
91.4% percent
correct for the full
data set
• RP was 15% below
full data set results
21
Results-J48 Spam Data
% Correct vs. Dimension #
82.000
80.000
78.000
76.000
RP Max
74.000
72.000
70.000
68.000
RP Min
5
RP gave
consistent
results with a
very small
split between
maximum and
minimum
values
10 15 20 25 30 35 40 45
22
Results-IB1 Spam Data
Percent Correct
# Dim
5
10
15
20
25
30
35
40
45
58
PCA
RP Max RP Min Full Set
89.450 81.522 78.772
91.113 79.987 77.558
90.601 81.522 78.069
89.450 80.499 77.110
89.386 80.755 79.156
89.642 81.841 78.900
91.049 82.481 79.923
90.154 81.266 79.412
89.067 81.905 79.028
89.450
• PCA gave
uniformly good
results for all
dimension levels
• PCA gave results
comparable to the
89.5% percent
correct for the full
data set
• RP was 10%
below full data set
results
23
Results-IB1 Spam Data
% Correct vs. Dimension #
84.000
82.000
80.000
RP Max
78.000
RP Min
76.000
74.000
5
10
15
20
25
30
35
40
45
RP gave
consistent
results with a
very small
split between
maximum and
minimum
values
24
Results SPAM Data
PCA gave consistent results for all
dimension levels

Expected lower dimension levels to not
perform as well
RP gave consistent, but lower, results
for all dimension levels

Also expected lower dimension levels to
not perform as well
25
Results-J48 Yahoo Data
Percent Correct
# Dim
25
50
100
150
200
250
300
600
PCA RP Max
90.452 56.030
92.085 59.548
89.824 60.553
90.452 58.668
89.322 58.920
91.332 58.794
90.327 58.417
90.201 58.794
RP Min
52.136
54.146
58.668
56.281
58.920
58.417
56.658
58.417
• PCA gave
uniformly good
results for all
dimension levels
• RP was over 30%
below PCA results
Note: Did not run data mining on full data
set due to large dimension number
26
Results-J48 Yahoo Data
% Correct vs. Dimension #
90.000
80.000
RP Max
RP Min
PCA
70.000
60.000
50.000
25
50 100 150 200 250 300 600
RP gave
consistent
results with a
very small
split between
maximum and
minimum
values
RP results
were much
lower than
PCA
27
Results-IB1 Yahoo Data
Percent Correct
# Dim
25
50
100
150
200
250
300
600
PCA RP Max RP Min
93.216 73.869 70.603
94.347 79.987 76.759
90.201 82.663 81.658
87.814 80.025 79.146
87.312 82.035 80.653
86.558 82.789 79.271
84.422 82.035 78.895
79.900 82.789 82.035
• PCA percent
correct decreased as
dimension number
increased
• RP was 20%
below PCA at low
dimension numbers,
decreasing to 0% at
high dimension
numbers
Note: Did not run data mining on full data set
due to large dimension number
28
Results-IB1 Yahoo Data
% Correct vs. Dimension #
95.000
90.000
85.000
RP Max
RP Min
PCA
80.000
75.000
RP gave
consistent results
with a very small
split between
maximum and
minimum values
RP results were
similar to PCA at
high dimension
levels
70.000
25
50 100 150 200 250 300 600
29
Results Yahoo Data
PCA showed consistently high results
for the Decision Tree output, but
showed decreasing results for higher
dimensions for Nearest Neighbor
output


Could be over fitting in Nearest Neighbor
case
Decision Tree has pruning to prevent over
fitting
30
Results Yahoo Data
RP showed consistent results for both
Nearest Neighbor and Decision Trees

The lower dimension numbers gave slightly
lower results
 Approximately 10-20% for dimension numbers
less than 100

The Nearest Neighbor results were 20%
higher than Decision Tree results
31
Overall Results
RP gives consistent results with few
inconsistencies over multiple runs
In general RP is faster by many orders (10 to
100) of magnitude over PCA but in most cases
produced lower accuracy
The RP results are closer to PCA using the
Nearest Neighbor data mining technique
Would suggest using RP if speed of processing
is most important
32
Future Work
Need to examine additional data sets to
determine if results are consistent
Both PCA and RP are linear tools. They map
the original dataset using a linear mapping.
Examine deriving PCA using SVD for speed
A more general comparison would include
non-linear dimensionality reduction methods
such as:


Kernel PCA
SVM
33
References
E. Bingham and H. Mannila, “Random projection in
dimensionality reduction: Applications to image and text
data”, KDD 2001
D. Fradkin and D. Madigan, “Experiments with Random
Projections for Machine Learning”, SOGLDD ’03,
August 2003
J. Lin and D. Gunopulos, “Dimensionality Reduction by
Random Projection and Latent Semantic Indexing”,
Proceedings of the Text Mining Workshop, at the 3rd
SIAM International Conference on Data Mining, May
2003
K. Torkkola, “Linear Discriminant Analysis in Document
Classification”, IEEE Workshop on Text Mining
(TextDM’2001), November 2001
34
Questions?
35