Transcript X Y

CSE 446: Naïve Bayes
Winter 2012
Dan Weld
Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein
Today
• Gaussians
• Naïve Bayes
• Text Classification
2
Long Ago
•
•
•
•
Random variables, distributions
Marginal, joint & conditional probabilities
Sum rule, product rule, Bayes rule
Independence, conditional independence
3
Last Time
Prior
Hypothesis
Maximum Likelihood
Estimate
Uniform
The most likely
Maximum A
Posteriori Estimate
Any
The most likely
Bayesian Estimate
Any
Weighted
combination
Bayesian Learning
Use Bayes rule:
Data Likelihood
Prior
Posterior
Normalization
Or equivalently:
Conjugate Priors?
6
Those Pesky Distributions
Discrete
Binary {0, 1}
Continuous
M Values
Single
Event
Bernouilli
Sequence
(N trials)
Binomial
Multinomial
Conjugate
Prior
Beta
Dirichlet
N= H+T
7
What about continuous variables?
• Billionaire says: If I am
measuring a
continuous variable,
what can you do for
me?
• You say: Let me tell
you about Gaussians…
Some properties of Gaussians
• Affine transformation (multiplying by scalar and
adding a constant) are Gaussian
– X ~ N(,2)
– Y = aX + b  Y ~ N(a+b,a22)
• Sum of Gaussians is Gaussian
– X ~ N(X,2X)
– Y ~ N(Y,2Y)
– Z = X+Y  Z ~ N(X+Y, 2X+2Y)
• Can make easy to differentiate, as we will see soon!
Learning a Gaussian
• Collect a bunch of data
– Hopefully, i.i.d. samples
– e.g., exam scores
• Learn parameters
– Mean: μ
– Variance: σ
Xi Exam
i = Score
0
85
1
95
2
100
3
12
… …
99 89
MLE for Gaussian:
• Prob. of i.i.d. samples D={x1,…,xN}:
• Log-likelihood of data:
Your second learning algorithm:
MLE for mean of a Gaussian
• What’s MLE for mean?
MLE for variance
• Again, set derivative to zero:
Learning Gaussian parameters
• MLE:
Bayesian learning of Gaussian
parameters
• Conjugate priors
– Mean: Gaussian prior
– Variance: Wishart Distribution
Supervised Learning of Classifiers
Find f
• Given: Training set {(xi, yi) | i = 1 … n}
• Find: A good approximation to f : X  Y
Examples: what are X and Y ?
• Spam Detection
– Map email to {Spam,Ham}
• Digit recognition
Classification
– Map pixels to {0,1,2,3,4,5,6,7,8,9}
• Stock Prediction
– Map new, historic prices, etc. to
Â(the real numbers)
Bayesian Categorization
• Let set of categories be {c1, c2,…cn}
• Let E be description of an instance.
• Determine category of E by determining for each ci
P(ci ) P( E | ci )
P(ci | E ) 
P( E )
• P(E) can be ignored since is factor  categories
P (ci | E ) ~ P (ci )P (E | ci )
20
Text classification
• Classify e-mails
– Y = {Spam,NotSpam}
• Classify news articles
– Y = {what is the topic of the article?}
• Classify webpages
– Y = {Student, professor, project, …}
• What to use for features, X?
Example: Spam Filter
• Input: email
• Output: spam/ham
• Setup:
– Get a large collection of example
emails, each labeled “spam” or
“ham”
– Note: someone has to hand label
all this data!
– Want to learn to predict labels of
new, future emails
• Features: The attributes used to
make the ham / spam decision
– Words: FREE!
– Text Patterns: $dd, CAPS
– For email specifically,
Semantic features: SenderInContacts
– …
Dear Sir.
First, I must solicit your confidence in this
transaction, this is by virture of its nature
as being utterly confidencial and top
secret. …
TO BE REMOVED FROM FUTURE
MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
SUBJECT.
99 MILLION EMAIL ADDRESSES
FOR ONLY $99
Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
Dimension XPS sitting in the corner and
decided to put it to use, I know it was
working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
Features X are word sequence in document
Xi for ith word in article
Features for Text Classification
• X is sequence of words in document
• X (and hence P(X|Y)) is huge!!!
– Article at least 1000 words, X={X1,…,X1000}
– Xi represents ith word in document, i.e., the domain of Xi is entire
vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.
• 10,0001000 = 104000
• Atoms in Universe: 1080
– We may have a problem…
Bag of Words Model
Typical additional assumption –
– Position in document doesn’t matter:
• P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
(all positions have the same distribution)
– Ignore the order of words
When the lecture is over, remember to wake up the
person sitting next to you in the lecture room.
Bag of Words Model
Typical additional assumption –
– Position in document doesn’t matter:
• P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
(all position have the same distribution)
– Ignore the order of words
– Sounds really silly, but often works very well!
in is lecture lecture next over person remember room
sitting the the the to to up wake when you
– From now on:
• Xi = Boolean: “wordi is in document”
• X = X1 … Xn
Bag of Words Approach
aardvark
0
about
2
all
2
Africa
1
apple
0
anxious
0
...
gas
1
...
oil
1
…
Zaire
0
Bayesian Categorization
P(y1 | X) ~ P(yi)P(X|yi)
• Need to know:
– Priors: P(yi)
– Conditionals: P(X | yi)
• P(yi) are easily estimated from data.
– If ni of the examples in D are in yi, then P(yi) = ni / |D|
• Conditionals:
– X = X1 … Xn
– Estimate P(X1 … Xn | yi)
• Too many possible instances to estimate!
– (exponential in n)
– Even with bag of words assumption!
28
Need to Simplify Somehow
• Too many probabilities
– P(x1  x2  x3 | yi)
P(x1  x2  x3 | spam)
P(x1  x2  x3 | spam)
P(x1   x2  x3 | spam)
….
P( x1   x2   x3 |  spam)
• Can we assume some are the same?
– P(x1  x2 ) = P(x1)P(x2)
?
29
Conditional Independence
• X is conditionally independent of Y given Z, if
the probability distribution for X is independent
of the value of Y, given the value of Z
• e.g.,
• Equivalent to:
Naïve Bayes
• Naïve Bayes assumption:
– Features are independent given class:
– More generally:
• How many parameters now?
• Suppose X is composed of n binary features
The Naïve Bayes Classifier
• Given:
– Prior P(Y)
– n conditionally independent
features X given the class Y
– For each Xi, we have likelihood
P(Xi|Y)
• Decision rule:
Y
X1
X2
Xn
MLE for the parameters of NB
• Given dataset, count occurrences
• MLE for discrete NB, simply:
– Prior:
– Likelihood:
Subtleties of NB Classifier 1
Violating the NB Assumption
• Usually, features are not conditionally independent:
• Actual probabilities P(Y|X) often biased towards 0 or 1
• Nonetheless, NB is the single most used classifier out
there
– NB often performs well, even when assumption is violated
– [Domingos & Pazzani ’96] discuss some conditions for good
performance
Subtleties of NB classifier 2: Overfitting
For Binary Features: We already
know the answer!
• MAP: use most likely parameter
• Beta prior equivalent to extra observations for each
feature
• As N → 1, prior is “forgotten”
• But, for small sample size, prior is important!
That’s Great for Bionomial
• Works for Spam / Ham
• What about multiple classes
– Eg, given a wikipedia page, predicting type
38
Multinomials: Laplace Smoothing
• Laplace’s estimate:
– Pretend you saw every outcome k
extra times
– What’s Laplace with k = 0?
– k is the strength of the prior
– Can derive this as a MAP estimate
for multinomial with Dirichlet priors
• Laplace for conditionals:
– Smooth each condition
independently:
H
H
T
Naïve Bayes for Text
• Modeled as generating a bag of words for a
document in a given category by repeatedly
sampling with replacement from a
vocabulary V = {w1, w2,…wm} based on the
probabilities P(wj | ci).
• Smooth probability estimates with Laplace
m-estimates assuming a uniform distribution
over all words (p = 1/|V|) and m = |V|
– Equivalent to a virtual sample of seeing each word in
each category exactly once.
40
Easy to Implement
• But…
• If you do… it probably won’t work…
41
Probabilities: Important Detail!
• P(spam | X1 … Xn) =
i P(spam | X )
i
Any more potential problems here?
 We are multiplying lots of small numbers
Danger of underflow!
 0.557 = 7 E -18
 Solution? Use logs and add!
 p1 * p2 = e log(p1)+log(p2)
 Always keep in log form
Naïve Bayes Posterior Probabilities
• Classification results of naïve Bayes
– I.e. the class with maximum posterior probability…
– Usually fairly accurate (?!?!?)
• However, due to the inadequacy of the
conditional independence assumption…
– Actual posterior-probability estimates not accurate.
– Output probabilities generally very close to 0 or 1.
43
NB with Bag of Words for text classification
• Learning phase:
– Prior P(Y)
• Count how many documents from each topic (prior)
– P(Xi|Y)
• For each topic, count how many times you saw word in
documents of this topic (+ prior); remember this dist’n
is shared across all positions i
• Test phase:
– For each document
• Use naïve Bayes decision rule
Twenty News Groups results
Learning curve for Twenty News Groups
What if we have continuous Xi ?
Eg., character recognition: Xi is ith pixel
Gaussian Naïve Bayes (GNB):
Sometimes assume variance
• is independent of Y (i.e., i),
• or independent of Xi (i.e., k)
• or both (i.e., )
Estimating Parameters: Y discrete, Xi continuous
Maximum likelihood estimates:
• Mean:
• Variance:
jth training
example
(x)1 if x true,
else 0
Example: GNB for classifying mental states
[Mitchell et al.]
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe
10 sec
measures Blood
Oxygen Level
Dependent (BOLD)
response
Typical
impulse
response
Brain scans can
track activation
with precision
and sensitivity
[Mitchell et al.]
Gaussian Naïve Bayes: Learned voxel,word
P(BrainActivity | WordCategory = {People,Animal})
[Mitchell et al.]
Learned Bayes Models – Means for
P(BrainActivity | WordCategory)
[Mitchell et al.]
Pairwise classification accuracy: 85%
People words
Animal words
58
Bayes Classifier is Optimal!
• Learn: h : X  Y
– X – features
– Y – target classes
• Suppose: you know true P(Y|X):
– Bayes classifier:
• Why?
Optimal classification
• Theorem:
– Bayes classifier hBayes is optimal!
– Why?
•
What you need to know about Naïve
Bayes
Naïve Bayes classifier
–
–
–
–
What’s the assumption
Why we use it
How do we learn it
Why is Bayesian estimation important
• Text classification
– Bag of words model
• Gaussian NB
– Features are still conditionally independent
– Each feature has a Gaussian distribution given class
• Optimal decision using Bayes Classifier
Text Naïve Bayes Algorithm
(Train)
Let V be the vocabulary of all words in the documents in
For each category ci  C
Let Di be the subset of documents in D in category ci
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in D
Let ni be the total number of word occurrences in Ti
For each word wj  V
Let nij be the number of occurrences of wj in Ti
Let P(wi | ci) = (nij + 1) / (ni + |V|)
62
Text Naïve Bayes Algorithm
(Test)
Given a test document X
Let n be the number of word occurrences in X
Return the category:
n
argmax P(ci ) P(ai | ci )
ci C
i 1
where ai is the word occurring the ith position in X
63
Naïve Bayes Time Complexity
• Training Time: O(|D|Ld + |C||V|))
where Ld is the average length of a document in D.
– Assumes V and all Di , ni, and nij pre-computed in
O(|D|Ld) time during one pass through all of the data.
– Generally just O(|D|Ld) since usually |C||V| < |D|Ld
• Test Time: O(|C| Lt)
is the average length of a test document.
where Lt
• Very efficient overall, linearly proportional to the
time needed to just read in all the data.
64
Multi-Class Categorization
• Pick the category with max probability
• Create many 1 vs other classifiers
• Use a hierarchical approach (wherever hierarchy
available)
Entity
Person
Location
Scientist Artist City County Country
65