c categories - Hunter College

Download Report

Transcript c categories - Hunter College

Introduction to Computational Natural
Language Learning
Linguistics 79400 (Under: Topics in Natural Language Processing)
Computer Science 83000 (Under: Topics in Artificial Intelligence)
The Graduate School of the City University of New York
Fall 2001
William Gregory Sakas
Hunter College, Department of Computer Science
Graduate Center, PhD Programs in Computer Science and Linguistics
The City University of New York
1
Suppose we have a single search word "egg."
All documents on in our corpus (e.g. web pages) are organized into the
following categories: dishwasher, poultry and pregnancy.
What is the likelihood that the keyword is intended to "key into" each of
the 3 categories of documents? That is, which category would be the
best prediction for a search engine to make?
Can easily be based on word frequencies in a bag of words approach.
Say, for example:
In documents classified as
the word "egg" appears
dishwasher related
379 times
poultry related
1,617 times
pregnancy related
824 times
2
Clearly, without any other words, poultry would be the best prediction.
Let's formalize this a bit: We want argmax p( c | "egg")
c
That is, the category c, that maximizes the probability of c given "egg."
Take an example. What's p(poultry|"egg")?
Take all occurrences of "egg" from all documents in our collection (in our
example that would be 2,820 (= 379 + 1,617 + 834) and partition them into
their categories.
"egg"
1,617
379
# occurrences of "egg"
in documents in
dishwasher category
834
# occurrences of "egg"
in documents in
pregnancy category
# occurrences of "egg"
in documents in poultry
category
3
p(dishwasher | egg)
p(poultry | egg)
p(pregnancy | egg)
= 379/2,820
= 1,617/2,820
= 824/2,820
In fact, since denominator is the same, when calculating argmax, we just
drop it, and calculate simply the max number of occurrences of "egg" in
each category. That is, we want: argmax count ( "egg" in category c ).
c
Unfortunately, calculating this is quite expensive. We have to go through
EVERY document in every category. So instead we apply Baye's Rule:
p( A | B) =
p(B | A) P (B )
---------------p(A)
or in our example:
p( category | word ) =
p( word | category) p(category)
-----------------------------------p(word)
But since we are finding the maximum probability, we can drop the
denominator: argmax p( c | word ) = argmax p( word | c ) p( c )
ccategories
ccategories
4
Easy to extend a single word to multiple words, and we get the
basic version of the NAIVE BAYES algorithm:
(1)
argmax
p( words | c ) p( c )
c  categories
p( c ) is simply the probability of category c being chosen independent of
any words. For example by the formula:
total number of words in all documents categorized as c
---------------------------------------------------------------total number of words in the entire corpus
(BTW, Why is (1) easier to compute than argmax p( c | words ) ?
ccategories
Because in order to compute the second equation we would need to compute 2n entries, where n =
number of words, to obtain a joint probability distribution. See next slide. The more computery
5
students, see me after class or email me if interested in further discussion).
"egg"
1,617
824
379
p(Poultry | "egg") =
1,617 /
(379 + 1,617 + 824)
Dishwashers
Poultry
Pregnancy
6
In any event, the best predictor of a category, given a bunch of words, is given
by (1) above.
A final equation. If words contain the words: w1, w2, . . . , wn
then (1) can be rewritten as assuming the words are independently likely to
occur! - pretty naive but it works fairly well in practice:
(2) argmax
c  categories
[ p(w1|c) * p(w2|c) * . . . * p(wn|c) ] * p( c )
And this is the way the implementation works.
A) A corpus of documents are feed to the learner, the words are counted up
and stored so that the probabilities in (2) can be effectively calculated.
B)An unseen document is given to the learner, (2) is calculated where
w1, w2, . . . , wn are the words in the document and the category that maximizes
(2) is returned by the learner.
7
The vector space model.
speech
language
processing
Doc 1
6
0
1
Doc 2
0
5
1
Doc 3
1
2
1
Shorter notation:
Doc 1 = <6,0,1>
Doc 2 = <0,5,1>
Doc 3 = <1,2,1>
<0,5,1>
But need to normalize. For example,
<1,2,1> should be considered very
similar to <3,6,3>. Easy to do.
<1,2,1>
speech
processing
<6,0,1>
8