SIMS 290-2: Applied Natural Language Processing: Marti Hearst

Transcript SIMS 290-2: Applied Natural Language Processing: Marti Hearst

Probability Theory
Bayes Theorem and Naïve Bayes
classification
1
Definition of Probability
Probability theory encodes our knowledge
or belief about the collective likelihood of
the outcome of an event.
We use probability theory to try to predict
which outcome will occur for a given
event.
Slide adapted from Dan Jurafsky's
2
Sample Spaces
We think about the “sample space” as being set of all
possible outcomes:
For tossing a coin, the possible outcomes are Heads or Tails.
For competing in the Olympics, the set of outcomes for a
given contest are {gold, silver, bronze, no_award}.
For computing part-of-speech, the set of outcomes are {JJ,
DT, NN, RB, … etc.}
We use probability theory to try to predict which outcome
will occur for a given event.
Slide adapted from Dan Jurafsky's
3
Axioms of Probability Theory
Probabilities are real numbers between 01 representing
the a priori likelihood that a proposition is true.
Necessarily true propositions have probability 1,
necessarily false have probability 0.
P(true) = 1
Slides adapted from Mary Ellen Califf
P(false) = 0
4
Probability Axioms
Probability law must satisfy certain properties:
Nonnegativity
P(A) >= 0, for every event A
Additivity
If A and B are two disjoint events, then the probability of
their union satisfies:
P(A U B) = P(A) + P(B)
Normalization
The probability of the entire sample space S is equal to 1,
i.e., P(S) = 1.
P(Cold) = 0.1 (the probability that it will be cold)
P(¬Cold) = 0.9 (the probability that it will be not cold)
Slide adapted from Dan Jurafsky's
5
An example
An experiment involving a single coin toss
There are two possible outcomes, Heads and Tails
Sample space S is {H,T}
If coin is fair, should assign equal probabilities to 2
outcomes
Since they have to sum to 1
P({H}) = 0.5
P({T}) = 0.5
P({H,T}) = P({H})+P({T}) = 1.0
Slide adapted from Dan Jurafsky's
6
Another example
Experiment involving 3 coin tosses
Outcome is a 3-long string of H or T
S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
Assume each outcome is equally likely (equiprobable)
“Uniform distribution”
What is probability of the event A that exactly 2
heads occur?
A = {HHT,HTH,THH}
P(A) = P({HHT})+P({HTH})+P({THH})
= 1/8 + 1/8 + 1/8
=3/8
Slide adapted from Dan Jurafsky's
7
Probability definitions
In summary:
P(E) =
number of outcomes corresponding to event E
--------------------------------------------------------total number of outcomes
Probability of drawing a spade from 52 well-shuffled playing cards:
13/52 = ¼ = 0.25
Slide adapted from Dan Jurafsky's
8
How about non-uniform
probabilities? An example:
A biased coin,
twice as likely to come up tails as heads,
is tossed twice
What is the probability that at least one head occurs?
Sample space = {hh, ht, th, tt} (h = heads, t = tails)
Sample points/probability for the event:
(each head has probability 1/3; each tail has prob 2/3):
ht 1/3 x 2/3 = 2/9
hh 1/3 x 1/3= 1/9
th 2/3 x 1/3 = 2/9
tt 2/3 x 2/3 = 4/9
Answer: 5/9 = 0.56
(sum of weights in red since want outcome with at least
one heads)
By contrast, probability of event for the unbiased coin = 0.75
Slide adapted from Dan Jurafsky's
9
Moving toward language
What’s the probability of drawing a 2 from a deck of
52 cards?
4
1
P(drawing a two) 
  .077
52 13
What’s the probability of a random word (from a
random dictionary page) being a verb?

# of ways to get a verb
P(drawing a verb) 
all words
Slide adapted from Dan Jurafsky's
10
Probability and part of speech tags
• What’s the probability of a random word (from a random
dictionary page) being a verb?
# of ways to get a verb
P(drawing a verb) 
all words
• How to compute each of these?

• All words = just count all the words in the dictionary
• # of ways to get a verb: number of words which are verbs!
• If a dictionary has 50,000 entries, and 10,000 are verbs….
P(V) is 10000/50000 = 1/5 = .20
Slide adapted from Dan Jurafsky's
11
Independence
Most things in life depend on one another, that is, they are
dependent:
If I drive to SF, this may effect your attempt to go there.
It may also effect the environment, the economy of SF,etc.
If 2 events are independent, this means that the occurrence of one
event makes it neither more nor less probable that the other occurs.
If I flip my coin, it won’t have an effect on the outcome of your
coin flip.
P(A,B)= P(A) · P(B) if and only if A and B are independent.
P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25
Note: P(A|B)=P(A) iff A,B independent
P(B|A)=P(B) iff A,B independent
Slide adapted from Dan Jurafsky's
12
Conditional Probability
A way to reason about the outcome of an experiment
based on partial information
How likely is it that a person has a disease given that
a medical test was negative?
In a word guessing game the first letter for the word
is a “t”. What is the likelihood that the second letter
is an “h”?
A spot shows up on a radar screen. How likely is it
that it corresponds to an aircraft?
Slide adapted from Dan Jurafsky's
13
Conditional Probability
Conditional probability specifies the probability given
that the values of some other random variables are
known.
P(Sneeze | Cold) = 0.8
P(Cold | Sneeze) = 0.6
The probability of a sneeze given a cold is 80%.
The probability of a cold given a sneeze is 60%.
Slides adapted from Mary Ellen Califf
14
More precisely
Given an experiment, a corresponding sample space S, and the
probability law
Suppose we know that the outcome is within some given event B
The first letter was ‘t’
We want to quantify the likelihood that the outcome also belongs
to some other given event A.
The second letter will be ‘h’
We need a new probability law that gives us the conditional
probability of A given B
P(A|B) “the probability of A given B”
Slide adapted from Dan Jurafsky's
15
Joint Probability Distribution
The joint probability distribution for a set of random variables X1…Xn
gives the probability of every combination of values
P(X1,...,Xn)
Sneeze
Cold
0.08
¬Cold
0.01
¬Sneeze
0.01
0.9
The probability of all possible cases can be calculated by summing
the appropriate subset of values from the joint distribution.
All conditional probabilities can therefore also be calculated
P(Cold | ¬Sneeze)
BUT it’s often very hard to obtain all the probabilities for a joint
distribution
Slides adapted from Mary Ellen Califf
16
Conditional Probability Example
•
•
•
•
Let’s say A is “it’s raining”.
Let’s say P(A) in dry California is .01
Let’s say B is “it was sunny ten minutes ago”
P(A|B) means
•
“what is the probability of it raining now if it was sunny 10
minutes ago”
• P(A|B) is probably way less than P(A)
• Perhaps P(A|B) is .0001
• Intuition: The knowledge about B should change our estimate of
the probability of A.
Slide adapted from Dan Jurafsky's
17
Conditional Probability
Let A and B be events
P(A,B) and P(A  B) both means “the probability that
BOTH A and B occur”
p(B|A) = the probability of event B occurring given
event A occurs
definition: p(A|B) = p(A  B) / p(B)
P(A, B) = P(A|B) * P(B)
P(A, B) = P(B, A)
Slide adapted from Dan Jurafsky's
(simple arithmetic)
(AND is symmetric)
18
Bayes Theorem
We start with conditional probability definition:
P( A, B)
P( A | B) 
P( B)
So say we know how to compute P(A|B). What if we
want to figure out P(B|A)? We can re-arrange the
formula using Bayes Theorem:
P( A | B) P( B)
P( B | A) 
P( A)
19
Deriving Bayes Rule
P(A  B)
P(A | B) 
P(A  B)
P(B | A) 
P(B)
P(A)
P(A | B)P(B)  P(A  B) P(B | A)P(A)  P(A  B)
 | B)P(B)  P(B | A)P(A)
P(A


P(B | A)P(A)
P(A | B) 
P(B)
Slide adapted from Dan Jurafsky's
20
How to compute probabilities?
We don’t have the probabilities for most NLP
problems
We can try to estimate them from data
(that’s the learning part)
Usually we can’t actually estimate the probability that
something belongs to a given class given the
information about it
BUT we can estimate the probability that something
in a given class has particular values.
Slides adapted from Mary Ellen Califf
21
Simple Bayesian Reasoning
If we assume there are n possible disjoint rhetorical zones,
z1 … zn
P(zi | w) = P(w | zi) P(zi)
P(w)
Want to know the probability of the zone given the word.
Can count how often we see the word in a sentence from
a given zone in the training set, and how often the zone
itself occurs.
P(w| zi ) = number of times we see this zone with this word
divided by how often we see the zone
This is the learning part.
Since P(w) is always the same, we can ignore it.
So now only need to compute P(w | zi) P(zi)
Slides adapted from Mary Ellen Califf
22
Bayes Independence Example
Imagine there are diagnoses ALLERGY, COLD, and WELL and
symptoms SNEEZE, COUGH, and FEVER
Prob
P(d)
P(sneeze|d)
P(cough | d)
Well
0.9
0.1
0.1
Cold
0.05
0.9
0.8
Allergy
0.05
0.9
0.7
P(fever | d)
0.01
0.7
0.4
Slides adapted from Mary Ellen Califf
23
Bayes Independence Example
By assuming independence, we ignore the possible interactions.
If symptoms are: sneeze & cough & no fever:
P(well | e) = (0.9)(0.1)(0.1)(0.99)/P(e) = 0.0089/P(e)
P(cold | e) = (.05)(0.9)(0.8)(0.3)/P(e) = 0.01/P(e)
P(allergy | e) = (.05)(0.9)(0.7)(0.6)/P(e) = 0.019/P(e)
P(e) = .0089 + .01 + .019 = .0379
P(well | e) = .23
P(cold | e) = .26
P(allergy | e) = .50
Diagnosis: allergy
Slides adapted from Mary Ellen Califf
24
Kupiec et al. Feature Representation
Fixed-phrase feature
Certain phrases indicate summary, e.g. “in summary”
Paragraph feature
Paragraph initial/final more likely to be important.
Thematic word feature
Repetition is an indicator of importance
Uppercase word feature
Uppercase often indicates named entities. (Taylor)
Sentence length cut-off
Summary sentence should be > 5 words.
25
Training
Hand-label sentences in training set (good/bad
summary sentences)
Train classifier to distinguish good/bad summary
sentences
Model used: Naïve Bayes
Can rank sentences according to score and show top
n to user.
26
Naïve Bayes Classifier
The simpler version of Bayes was:
P(B|A) = P(A|B)P(B)/P(A)
P(Sentence | feature) = P(feature | S) P(S)/P(feature)
Using Naïve Bayes, we expand the number of feaures by
defining a joint probability distribution:
P(Sentence, f1, f2, … fn) = P(Sentence) P(fi| Sentence)/  P(fi )
We learn P(Sentence) and P(fi| Sentence) in training
Test: we need to state P(Sentence | f1, f2, … fn)
P(Sentence| f1, f2, … fn) =
P(Sentence, f1, f2, … fn) / P(f1, f2, … fn)
27
Details: Bayesian Classifier
P( F1 , F2 ,...Fk | s  S ) P( s  S )
P( s  S | F1 , F2 ,...Fk ) 
P( F1 , F2 ,... Fk )
Assuming statistical independence:

P( s  S | F , F ,...F ) 
k
1
2
k
Probability that sentence s is included
in summary S, given that sentence’s
feature value pairs
j 1
Probability of feature-value pair
occurring in a source sentence
which is also in the summary
P( F j | s  S ) P( s  S )

k
P( F j )
j 1
compression
rate
Probability of feature-value pair
occurring in a source sentence
28
Bayesian Classifier (Kupiec at el 95)
Each Probability is calculated empirically from a
corpus
See how often each feature is seen with a sentence
selected for a summary, vs how often that feature is
seen in any sentence.
Higher probability sentences are chosen to be in the
summary
Performance:
For 25% summaries, 84% precision
From lecture notes by Nachum Dershowitz & Dan Cohen
29
How to compute this?
For training, for each feature f:
For each sentence s:
– Is the sentence in the target summary S? Increment T
– Increment F no matter which sentence it is in.
P(f|S) = T/N
P(f) = F/N
For testing, for each document:
For each sentence
– Multiply the probabilities of all of the features occurring in the
sentence times the probability of any sentence being in the
summary (a constant). Divide by the probability of the
features occurring in any sentence.
30
Learning Sentence Extraction Rules
(Kupiec et al. 95)
Results
About 87% (498) of all summary sentences (568) could be
directly matched to sentences in the source (79% direct
matches, 3% direct joins, 5% incomplete joins)
Location was best feature at 163/498 = 33%
Para+fixed-phrase+sentence length cut-off gave best
sentence recall performance … 217/498=44%
At compression rate = 25% (20 sentences), performance
peaked at 84% sentence recall
J. Kupiec, J. Pedersen, and F. Chen. "A Trainable Document Summarizer", Proceedings of the 18th ACM-SIGIR Conference, pages 68--73, 1995.
31
Language Identification
32
Language identification
Tutti gli esseri umani nascono liberi ed eguali
in dignità e diritti. Essi sono dotati di
ragione e di coscienza e devono agire gli uni
verso gli altri in spirito di fratellanza.
Alle Menschen sind frei und gleich an Würde und
Rechten geboren. Sie sind mit Vernunft und
Gewissen begabt und sollen einander im Geist
der Brüderlichkeit begegnen.
Universal Declaration of Human Rights, UN, in 363 languages
http://www.unhchr.ch/udhr/navigate/alpha.htm
33
Language identification
égaux
eguali
iguales
edistämään
Ü
¿
How to do determine, for a stretch of text, which
language it is from?
34
Language Identification
Turns out to be really simple
Just a few character bigrams can do it
(Sibun & Reynar 96)
Based on language models for sample languages
35
Language model basics
Unigram and bigram models
Evaluating N-gram language models
Perplexity
The intuition of perpexity is that given two
probabilistic models, the better model is the one that
better predicts new data (not used to train the
model)
We can measure better prediction by comparing the
probability the model assigns to the test data
The better probability will assign a higher probability
36
Perplexity
PP(W )  P( w1w2 ...wN )

1
N
N
1
P( w1w2 ...wN )
minimazing perplexity is equivalent to maximazing the
test set probability according to the language model
37
Entropy
X---a random variable with probability
distribution p(x)
H ( X )   p( x) log 2 p( x)
x
Measures how predictive a given N-gram model is
about what the next word could be
38
KL divergence (relative entropy)
P( x)
D( P || Q)   P( x) log
Q( x)
x
Basis of comparing two probability distributions
39
Language Identification
Turns out to be really simple
Just a few character bigrams can do it
(Sibun & Reynar 96)
Used Kullback Leibler distance (relative entropy)
Compare probability distribution of the test set to
those for the languages trained on
Smallest distance determines the language
Using special character sets helps a bit, but barely
40
Language Identification
(Sibun & Reynar 96)
41
Confusion Matrix
A table that shows, for each class, which ones your
algorithm got right and which wrong
Gold standard
Algorithm’s guess
42
43
Author Identification
(Stylometry)
44
Author Identification
Also called Stylometry in the humanities
An example of a Classification Problem
Classifiers:
Decide which of N buckets to put an item in
(Some classifiers allow for multiple buckets)
45
The Disputed Federalist Papers
In 1787-1788, Jay, Madison, and Hamilton
wrote a series of anonymous essays to
convince the voters of New York to ratify the
new U. S. Constitution.
Scholars have consensus that:
5 authored by Jay
51 authored by Hamilton
14 authored by Madison
3 jointly by Hamilton and Madison
12 remain in dispute … Hamilton or Madison?
46
Author identification
Federalist papers
In 1963 Mosteller and Wallace solved the problem
They identified function words as good candidates for
authorships analysis
Using statistical inference they concluded the author
was Madison
Since then, other statistical techniques have
supported this conclusion.
47
Function vs. Content Words
High rates for “by” favor M, low favor H
High rates for “from” favor M, low says little
High rates for “to” favor H, low favor M
48
Function vs. Content Words
No consistent pattern for “war”
49
Federalist Papers Problem
Fung, The Disputed Federalist Papers: SVM Feature Selection
Via Concave Minimization, ACM TAPIA’03
50
Classification
Goal: Assign ‘objects’ from a universe to two or more classes or
categories
Examples:
Problem
Tagging
Sense Disambiguation
Information retrieval
Sentiment classification
Author identification
Object
Word
POS
Word
Document
Document
Document
From: Foundations of Statistical Natural Language
Processing. Manning and Schutze
Categories
The word’s senses
Relevant/not relevant
Positive/negative
Authors
51
Text Categorization Applications
Web pages organized into category hierarchies
Journal articles indexed by subject categories (e.g., the Library
of Congress, MEDLINE, etc.)
Responses to Census Bureau occupations
Patents archived using International Patent Classification
Patient records coded using international insurance categories
E-mail message filtering
News events tracked and filtered by topics
Spam
Slide adapted from Paul Bennet
52
Yahoo
News
Categories
53
Cost of Manual Text Categorization
Yahoo!
 200 (?) people for manual labeling of Web pages
 using a hierarchy of 500,000 categories
MEDLINE (National Library of Medicine)
 $2 million/year for manual indexing of journal articles
 using MEdical Subject Headings (18,000 categories)
Mayo Clinic
 $1.4 million annually for coding patient-record events
 using the International Classification of Diseases (ICD) for
billing insurance companies
US Census Bureau decennial census (1990: 22 million
responses)
 232 industry categories and 504 occupation categories
 $15 million if fully done by hand
Slide adapted froml Paul Bennett
54
Knowledge
Engineering
vs.
Statistical
Learning
For US Census Bureau Decennial Census 1990
232 industry categories and 504 occupation categories
$15 million if fully done by hand
Define classification rules manually:
Expert System AIOCS
Development time: 192 person-months (2 people, 8 years)
Accuracy = 47%
Learn classification function
Nearest Neighbor classification (Creecy ’92: 1-NN)
Development time: 4 person-months (Thinking Machine)
Accuracy = 60%
Slide adapted froml Paul Bennett
55
Text Topic categorization
Topic categorization: classify the document into
semantics topics
The U.S. swept into the Davis
Cup final on Saturday when twins
Bob and Mike Bryan defeated
Belarus's Max Mirnyi and Vladimir
Voltchkov to give the Americans
an unsurmountable 3-0 lead in the
best-of-five semi-final tie.
One of the strangest, most
relentless hurricane seasons on
record reached new bizarre heights
yesterday as the plodding approach
of Hurricane Jeanne prompted
evacuation orders for hundreds of
thousands of Floridians and high
wind warnings that stretched 350
miles from the swamp towns south
of Miami to the historic city of St.
Augustine.
56
The Reuters collection
A gold standard
Collection of (21,578) newswire documents.
For research purposes: a standard text collection to compare
systems and algorithms
135 valid topics categories
57
Reuters
Top topics in Reuters
58
Reuters Document Example
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE>
kicks off
CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states
determining industry positions on a number of issues, according to the National Pork Producers
Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector.
The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all
areas of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>
59
Classification vs. Clustering
Classification assumes labeled data: we know how
many classes there are and we have examples for
each class (labeled data).
Classification is supervised
In Clustering we don’t have labeled data; we just
assume that there is a natural division in the data
and we may not know how many divisions (clusters)
there are
Clustering is unsupervised
60
Categories (Labels, Classes)
Labeling data
2 problems:
Decide the possible classes (which ones, how many)
Domain and application dependent
Label text
Difficult, time consuming, inconsistency between
annotators
61
Reuters Example, revisited
Why not topic = policy ?
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE>
kicks off
CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states
determining industry positions on a number of issues, according to the National Pork Producers
Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector.
The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all
areas of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>
62
Binary vs. multi-way classification
Binary classification: two classes
Multi-way classification: more than two classes
Sometime it can be convenient to treat a multi-way
problem like a binary one: one class versus all the
others, for all classes
63
Features
>>> text = "Seven-time Formula One champion
Michael Schumacher took on the Shanghai circuit
Saturday in qualifying for the first Chinese
Grand Prix."
>>> label = “sport”
>>> labeled_text = LabeledText(text, label)
Here the classification takes as input the
whole string
What’s the problem with that?
What are the features that could be useful for
this example?
64
Feature terminology
Feature: An aspect of the text that is relevant to the
task
Some typical features
Words present in text
Frequency of words
Capitalization
Are there NE?
WordNet
Others?
65
Feature terminology
Feature: An aspect of the text that is relevant to the
task
Feature value: the realization of the feature in the text
Words present in text : Kerry, Schumacher, China…
Frequency of word: Kerry(10), Schumacher(1)…
Are there dates? Yes/no
Are there PERSONS? Yes/no
Are there ORGANIZATIONS? Yes/no
WordNet: Holonyms (China is part of Asia),
Synonyms(China, People's Republic of China, mainland China)
66
Feature Types
Boolean (or Binary) Features
Features that generate boolean (binary) values.
Boolean features are the simplest and the most
common type of feature.
f1(text) = 1
0
f2(text) = 1
0
if text contain “Kerry”
otherwise
if text contain PERSON
otherwise
67
Feature Types
Integer Features
Features that generate integer values.
Integer features can be used to give classifiers access to
more precise information about the text.
f1(text) = Number of times text contains “Kerry”
f2(text) = Number of times text contains PERSON
68
Feature selection
How do we choose the “right” features?
Next lecture
69

SIMS 290-2: Applied Natural Language Processing: Marti Hearst

Transcript SIMS 290-2: Applied Natural Language Processing: Marti Hearst

Directory