Automatically Building Special Purpose Search Engines with

Download Report

Transcript Automatically Building Special Purpose Search Engines with

Document Classification with
Naïve Bayes -How to Build Yahoo Automatically
Andrew McCallum
Just Research & CMU
www.cs.cmu.edu/~mccallum
Joint work with Kamal Nigam, Jason Rennie, Kristie Seymore,
Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng.
2
3
4
5
Document Classification
Testing
Data:
“planning
(Planning)
language
semantics
proof
intelligence”
.
Categories:
ML
Training
Data:
learning
algorithm
reinforcement
intelligence
network...
Planning
Semantics
planning
temporal
reasoning
plan
language...
programming
semantics
types
language
proof...
Garb.Coll.
Multimedia
garbage
...
collection
memory
optimization
region...
GUI
...
6
A Probabilistic Approach to
Document Classification
Pick the most probable class, given the evidence:
c  arg max c j Pr(c j | d )
cj
- a class (like “Planning”)
d
- a document (like “language intelligence proof...”)
“Naïve Bayes”:
Bayes Rule:
|d |
Pr(c j | d ) 
Pr(c j ) Pr( d | c j )
Pr( d )

Pr(c j ) Pr( wdi |c j )
i 1
|d |
 Pr(ck ) Pr( wd |ck )
ck
wd i
- the i th word in d (like “proof”)
i 1
i
(1) One mixture-component per class
(2) Independence assumption
7
A Probabilistic Bayesian Approach
• Define a probabilistic generative model for
documents with classes.
• Learn the parameters of this model by fitting
them to the data and a prior.
 *  arg max  Pr( D |  ) Pr( )

8
Parameter Estimation in Naïve Bayes
Naïve Bayes
|d |
c  arg max j Pr(c j )   Pr( wdi | c j )
i 1
Maximum a posteriori estimate of Pr(w|c),
with a Dirichlet prior,
(AKA “Laplace smoothing”)
P̂r( wi | c j ) 
1
 N (w , d
d k c j
|V |
| V | 
i
k
)
 N (w , d
t 1 d k c j
t
k
)
where N(w,d) is
number of times
word w occurs
in document d.
Two ways to improve this method:
(A) Make less restrictive assumptions about the model
(B) Get better estimates of the model parameters, i.e. Pr(w|c)
9
The Scenario
Training data
with class labels
Web pages
user says are
interesting
Web pages
user says are
uninteresting
Data available at training
time, but without class labels
Web pages user
hasn’t seen or said
anything about
Can we use the unlabeled documents to increase accuracy?
10
Using the Unlabeled Data
Build a classification
model using limited
labeled data
Use model to estimate the
labels of the unlabeled
documents
Use all documents to build a new classification
model, which is often more accurate because it
is trained using more data.
11
An Example
Labeled Data
Baseball
Ice Skating
The new hitter struck
out...
Fell on the ice...
Struck out in last
inning...
Homerun in the first
inning...
Pete Rose is not
as good an athlete
as Tara Lipinski...
Perfect triple jump...
Katarina Witt’s gold
medal performance...
New ice skates...
Practice at the ice
rink every day...
Tara Lipinski’s substitute
ice skates didn’t hurt her
performance. She
graced the ice with a
series of perfect jumps
and won the gold medal.
Tara Lipinski bought
a new house for her
parents.
After EM:
Pr ( Lipinski | Ice Skating ) = 0.02
Before EM:
Pr ( Lipinski ) = 0.01
Unlabeled Data
Pr ( Lipinski ) = 0.001
Pr ( Lipinski | Baseball ) = 0.003
12
Filling in Missing Labels with EM
[Dempster et al ‘77], [Ghahramani & Jordan ‘95], [McLachlan & Krishnan ‘97]
Expectation Maximization is a class of iterative
algorithms for maximum likelihood estimation with
incomplete data.
• E-step: Use current estimates of model parameters
to “guess” value of missing labels.
• M-step: Use current “guesses” for missing labels to
calculate new estimates of model parameters.
• Repeat E- and M-steps until convergence.
Finds the model parameters that locally maximize the
probability of both the labeled and the unlabeled data.
13
EM for Text Classification
Expectation-step (estimate the class labels)
|d |
Pr(c j | d )  Pr(c j )   Pr( wdi | c j )
i 1
Maximization-step (new parameters using the estimates)
P̂r( wi | c j ) 
1   N ( wi , d k )  Pr(c j | d k )
d k
|V |
| V |   N ( wt , d k )  Pr(c j | d k )
d k t 1
14
WebKB Data Set
student
faculty
course
project
4 classes, 4199 documents
from CS academic departments
15
Word Vector Evolution with EM
Iteration 0
Iteration 1
Iteration 2
intelligence
DD
artificial
understanding
DDw
dist
identical
rus
arrange
games
dartmouth
natural
cognitive
logic
proving
prolog
DD
D
lecture
cc
D*
DD:DD
handout
due
problem
set
tay
DDam
yurtas
homework
kfoury
sec
D
DD
lecture
cc
DD:DD
due
D*
homework
assignment
handout
set
hw
exam
problem
DDam
postscript
(D is a digit)
16
EM as Clustering
X
X
X
= unlabeled
17
EM as Clustering, Gone Wrong
X
X
X
18
20 Newsgroups Data Set
…
20 class labels, 20,000 documents
62k unique words
19
Newsgroups Classification Accuracy
varying # labeled documents
Title:
Creator:
gnuplot
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
20
Newsgroups Classification Accuracy
varying # unlabeled documents
Title:
Creator:
gnuplot
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
21
WebKB Classification Accuracy
varying # labeled documents
Title:
Creator:
gnuplot
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
22
WebKB Classification Accuracy
varying weight of unlabeled data
Title:
Creator:
gnuplot
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
23
WebKB Classification Accuracy
varying # labeled documents
and selecting unlabeled weight by CV
Title:
Creator:
gnuplot
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
24
Populating a hierarchy
• Naïve Bayes
+ Simple, robust document classification.
+ Many principled enhancements (e.g. shrinkage).
– Requires some labeled training data.
• Keyword matching
+ Requires no labeled training data except
keywords themselves.
– Brittle, breaks easily
25
Combine Naïve Bayes and
Keywords for Best of Both
• Classify unlabeled documents with keyword
matching.
• Pretend these category labels are correct,
and use this data to train naïve Bayes.
• Naïve Bayes acts to temper and “round out” the
keyword class definitions.
• Brings in new probabilistically-weighted keywords
that are correlated with the few original keywords.
26
Top words found by naïve Bayes and Shrinkage
ROOT
computer, university, science, system, paper
Programming
programming
language
logic
university
programs
AI
learning
university
computer
based
intelligence
Hardware
circuits
designs
computer
university
performance
Semantics Garbage
NLP
Machine Planning
semantics Collection language Learning planning
denotational garbage
natural
learning temporal
language collection processing algorithm reasoning
construction memory
information university
plan
types
optimization
text
networks problems
region
HCI
computer
system
multimedia
university
paper
IR
information
text
documents
classification
retrieval
GUI Multimedia Cooperative
interface multimedia collaborative
design
real
CSCW
user
time
work
sketch
data
provide
interfaces media
group
27
Classification Results
400 test documents
70 classes in a hierarchy of depth 2-4
Method
Keyword
NB
NB
NB
NB+S
NB+EM+S
Human
# Labeled # Pseudo # Unlabeled Classification
Labeled
Accuracy
45%
100
30%
399
47%
12,657
47%
12,657
63%
12,657
18,025
66%
72%
28
Conclusions
• Naïve Bayes is a method of document
classification based on Bayesian statistics.
• Many parameters to estimate. Requires
much labeled training data.
• We can build on its probabilistic, statistical
foundations to improve performance (e.g.
unlabeled data + EM)
• These techniques are accurate and robust
enough to build useful Web services.
29