Part-of-speech tagging - School of Computer Science
Download
Report
Transcript Part-of-speech tagging - School of Computer Science
Part of Speech Tagging
John Barnden
School of Computer Science
University of Birmingham
Natural Language Processing 1
2014/15 Semester 2
Overview
• What the task is
• Brief: the range of approaches
• Some needed stuff on probability
• The core stochastic (probabilistic) approach
What the Task Is
• Assign a part of speech (POS) or similar such classification to each
lexical form (and possibly some multi-form sequences treated as
units) in an utterance ...
– and to punctuation marks and sometimes to affixes/clitics (e.g., ’s):
such things are separated off beforehand by a prior “tokenization” phase.
• Usually described as being applied to text or transcribed speech, but
could be to speech directly, in principle.
– We will assume just ordinary text.
– Transcribed speech: might need to cope with things like pause indicators
• We will look just at the usual case of choosing one best-looking
classification per word, but variant methods can assign more than
one (e.g. the most probable few).
The “Output” –Text/Tags
• The classification is usually recorded as a “tag” inserted into the text itself, next to
each word, punctuation mark, etc.
• E.g. [see J&M p.167]
Does that flight serve dinner?
Tagged as following [using Penn Treebank tagset: front pages of J&M]
Does/VBZ that/DT flight/NN serve/VB dinner/NN ?/.
VBZ = verb, 3rd-person singular present
VB
= verb, base form
DT
= determiner
NN = mass noun or singular count noun
.
= sentence-final punctuation ( . ? !)
• If some phrases (e.g. multi-word names) are treated as units, surrounding tags as in
HTML or XML style are better: <NNP>Royal Bank of Scotland</NNP>
Why It’s Easy to an Extent
but then Difficult
• A given form may have several different possible tags in different contexts (e.g.
“bank” can be verb or noun).
• But:
Can get >90% accuracy (compared to human tagging) just by assigning each word
its most common tag (as determined by statistical examination of corpora).
• But:
That’s up to 1 in 10 words (~ 1 word/sentence) wrongly tagged!
Difficulties arise in trying for higher accuracy ...
Why It’s Difficult
... Difficulties arise in trying for higher accuracy ...
– A given utterance may have alternative syntactic analyses, but POS tagging typically
precedes full syntactic analysis.
– An utterance may break normal syntactic rules, giving strange tag sequences.
– POSs may depend on each other over substantial distances in an utterance.
– The needed info about how POSs pattern together may be tricky to gain from corpora:
clever statistical tricks are needed to fill in gaps in what’s known.
– Genre specificity: corpora used need to be either general (i.e., mixed) or ideally from
the genre in question.
Types of Approach
• Rule-Based (including constraint-based)
– uses many hand-written rules saying what POS to choose from the
possibilities for a word, given info on surrounding POSs
– the earliest approach to be developed
• Stochastic [what we’ll mainly look at]
– often implemented via HMMs – Hidden Markov Models
– the hopefully most probable sequence of tags for an utterance is
worked out by standard probabilistic calculations and graph traversal
– usually relies on statistics gathered from manually tagged corpora
• Transformation-Based (including the popular “Brill” tagger)
– tagging is done by rules, in iterative passes : the previous pass’s tagging
can be transformed by rules
– but the rules are learned automatically, from manually tagged corpora
Required Basics of Probability
• Conditional Probability P(X|Y)
is the probability that X is the case given that Y is the case. E.g.:
– P(John has a cold | John is sneezing a lot) = 0.9
P(John is sneezing a lot | John has a cold) = 0.6
– P(a word is “the” | that word is a Determiner) = 0.27 (say)
P(a word is a Determiner | that word is “the”) = 0.99 (say)
• Can estimate P(X|Y) by
COUNT(number of occasions on which X and Y are both the case)
COUNT(number of occasions on which Y is the case)
•
i.e. The proportion of Y occasions on which X also happens
• P(X|Y) =
P(X and Y)
P(Y)
• P(X and Y) = P(X|Y) P(Y)
[a great thing to remember!]
Basics of Probability, contd
• P(X and Y) = P(X|Y) P(Y)
• But so also:
So:
P(X and Y)
= P(Y|X) P(X)
P(X|Y) P(Y) = P(Y|X) P(X)
So:
• P(X|Y) = P(Y|X) P(X) / P(Y)
[Bayes’ Rule – an even better memory]
Bayes’ Rule [should be Bayes’s Rule!] used all over the place in AI and in
statistical NLP.
Treated almost with religious reverence but is actually a trivial corollary of
the notion of conditional probability.
• P(X) is often called a “prior” probability and P(X|Y) a “posterior”
probability, because often Y is some given circumstance that is viewed as
modifying how likely it is that X is the case.
Basics of Probability, contd
• P(X|Y) and P(Y|X) can have very different values.
E.g., P(DT|an) = P(a word is DT | that word is “an”) could be close to 1,
whereas P(an|DT) could be e.g. 0.03
• But don’t fall into the trap of thinking that they somehow vary in opposite
directions.
For fixed P(X) and P(Y) [actually, for fixed ratio, P(X)/P(Y)] :
the higher that P(Y|X) is, the higher P(X|Y) is.
This fact will be a crucial intuition later.
Basics of Probability, contd
• Don’t make mistake of thinking that P(X|Y)P(Y) equals P(X).
Rather:
P(X) = P(X|Y) P(Y) + P(X| ¬Y) P( ¬Y)
This is simply because
P(X) = P(X and Y) + P(X and ¬Y).
• In general, if Y has n mutually exclusive sub-possibilities Y1 to Yn,
P(X) = P(X|Y1)P(Y1) + ..... + P(X|Yn)P(Yn)
“Ignoring the Bayes Denominator”
•
Suppose you want to find which of P(X1|Y) and P(X2|Y) is bigger, but don’t
particularly need to know how big it is.
E.g., Y is that Johnny has red spots, X1 is that Johnny has measles and X2 is that
Johnny has an allergy. You want to find out Johnny’s more likely condition.
It may be that P(Y|X1) and P(Y|X2) are easier to get information about. It may even
be that we know enough about how medical conditions work to estimate the
probabilities without gathering statistics, whereas this is more unlikely for P(X1|Y)
and P(X2|Y).
•
We know that
P(X1|Y) = P(Y|X1) P(X1)/P(Y) and
P(X2|Y) = P(Y|X2) P(X2)/P(Y).
To find out which is the bigger value, P(Y) “cancels out” – it’s irrelevant.
•
So we just need to find which is the bigger of P(Y|X1) P(X1) and P(Y|X2) P(X2).
This isn’t just a mathematical trick – it’s simply that we want to know which of P(Y
and X1) and P(Y and X2) is the bigger. And you might have seen that anyway from
first principles!
“Which is the Biggest”
•
Some mathematical notation for “which is the biggest”:
If there are some values V(x) that depend on x (which varies within some specified
set of values) then the value of x that makes V(x) biggest can be expressed as
argmaxx(V(x)).
In the example above, we want argmaxx( P(x|Y) ) if it’s understood that x is in the
set {X1, X2}. This of course generalizes to more diseases, {X1, X2, ..., Xn}.
► The more/most likely disease given the symptoms.
The Probability Task in POS Tagging
•
The task is to find
best-tagging = argmaxtagging( P(tagging| lexform-seq) )
where lexform-seq is some given sequence of lexical forms and tagging is a proposed
tagging (tag-sequence) of it, within the set of all possible taggings of that lexical form
sequence.
► The most likely tagging given the lexical form sequence.
► We start the lexical form sequence with one or more start-of-sequence markers Ø,
and the tag sequence with that many null tags Ø.
•
We use Bayes’ Rule to get
best-tagging = argmaxtagging( P( lexform-seq |tagging) * P(tagging )/P(lexform-seq ))
But lexform-seq is fixed, so we can “ignore the denominator,” simplifying to
best-tagging = argmaxtagging( P( lexform-seq |tagging) * P(tagging ) )
The Task in POS Tagging, contd.
•
So we’ve replaced the calculation of
best-tagging = argmaxtagging( P(tagging| lexform-seq) )
by the calculation of
best-tagging = argmaxtagging( P(lexform-seq |tagging) * P(tagging ) )
•
Why on earth is this an improvement?? you ask.
•
Because ....
We can approximate ...
best-tagging = argmaxtagging( P(lexform-seq |tagging) * P(tagging ) )
– We can approximate the prior probability P(tagging) – see following slides.
– It turns out we can approximate P(lexform-seq |tagging) pretty well by pretending that, at each
position in the sequence, the probability of the lexical form being there (given the whole
tagging) actually depends only on the tag there; i.e. the probability at that position is just
P(lexform |tag), which can be estimated from a corpus by count(lexform & tag) / count(tag).
so P( lexform-seq |tagging) is approximated as the product of the individual P(lexform |tag)
values
(ignoring P(lexform0 |tag0) = P(Ø | Ø) = 1)
i.e.
P(lexformn |tagn) * P(lexformn-1 |tagn-1) * ..... * P( lexform1 |tag1)
i.e.
Πni=1 P( lexformi |tagi)
(Π is a product operator, exactly analogous to the summation operator Σ)
Towards P(tagging):
Iterative Conditionalization
• Suppose there is a set Z of possibilities, and we list them in some order (any
order we like) as ZØ to Zm. Then
P(ZØ,...,Zm), i.e. P(ZØ and .... and Zm), = P(Zm | Zm-1, ... , ZØ) * P(Zm-1, ... , ZØ).
But we can do the same thing on P(Zm-1, ... ,Z0), and ultimately we get:
P(ZØ,...,Zm) = P(Zm | Zm-1, ... , ZØ) * P(Zm-1 | Zm-2, ... , ZØ) * ... * P(ZØ).
Iterative Conditionalization for Tags
• One application of previous slide: consider sentences of length n plus the
imaginary start-of-sequence marker Ø. For convenience we’ll include under
“sentences” other meaningful lexical-form form sequences that might be used in
place of sentences.
• Suppose tagging = TØ,...Tn is a list of tags, with TØ being Ø.
• We ask: what is P(TØ,...,Tn) i.e. the probability of an arbitrary sentence of length
n, plus Ø at the start, having the POSs (etc.) TØ,...Tn?
– NB: this is not a question about any specific sentences
• Answer (where Tk now means that word k, whatever it is, has tag Tk):
P(TØ,...,Tn) = P(Tn | Tn-1, ... ,TØ) * P(Tn-1 | Tn-2, ... ,TØ) * ... * P(T1|TØ) * P(TØ).
•
But P(TØ) = 1, so
P(TØ,...,Tn) = P(Tn | Tn-1, ... ,TØ) * P(Tn-1 | Tn-2, ... ,TØ) * ... * P(T1|TØ)
N-gram Approximations
•
An n-gram is a sequence of n lexical forms (for example).
NLP is much concerned with the statistics of n-grams, e.g. What’s the frequency of
occurrence of the 2-gram (bigram) “dog bit”.
Bigrams (and trigrams) are of special interest in the POS tagging area.
•
The idea extends to sequences of tags rather than lexical forms.
•
We can apply approximations to conditional probabilities concerning the dependence of
tags on each other.
E.g. We can adopt the bigram approximation that
P(Tk | Tk-1, ... ,TØ) = P(Tk | Tk-1)
i.e. The probability of a tag occurring is dependent only on the immediately preceding
tag, not earlier ones.
•
The trigram approximation is that
P(Tk | Tk-1, Tk-2, ... ,TØ) = P(Tk | Tk-1, Tk-2)
(We introduce two start-of-sentence markers to make sure everything works right.)
N-gram Approximations, contd
•
Then under the bigram approximation we get
P( TØ,...,Tn ) = P(Tn | Tn-1) * P(Tn-1 | Tn-2) * ... * P(T1|TØ)
= Πni=1 P( Ti | Ti-1)
•
So if you had such bigram probabilities about tags then you could approximate the
probability of a suggested tagging.
•
NB: This is a “prior” probability, not dependent on any specific words being tagged. It’s just
about the patterns of possible tag sequences, across the whole language.
•
Can estimate P( Ti | Ti-1) = P(tag at i is a tag Y | tag at i-1 is a tag X) from a corpus as
•
count(tag X occurs & next lex-form has tag Y) / count(tag X occurs).
NB: This estimation ignores the numerical position, i.
•
Similar observations concerning trigrams. (Exercise: work out the formula that replaces the
one at the top of this slide.)
Putting It Together
• Recall best-tagging = argmaxtagging( P(lexform-seq |tagging) * P(tagging ) )
Then, using our approximations,
• best-tagging = argmaxT1,...Tn (Πni=1 P( lexformi |Ti) *
Πni=1 P( Ti | Ti-1) )
i.e.,
best-tagging = argmaxT1,...Tn(Πni=1 [ P(lexformi |Ti) * P( Ti | Ti-1) ] )
Towards an Algorithm
best-tagging = argmaxT1,...Tn(Πni=1 [ P(lexformi |Ti) * P( Ti | Ti-1) ] )
•
Notice a sort of “incremental maximality” property:
Let best-tagging be B1,...,Bn. Then, for any k, B1,...Bk must be the best (i.e., highest
Πki=1 value) sub-tagging that ends with Bk (though not necessarily the best that ends at
position k) because otherwise you could replace that sub-tagging with a better one and
thus improve best-tagging itself.
You could do that replacement because the tags within it don’t affect the above P
values after position k.
•
Above is key to an incremental algorithm that steps through the sentence finding best
sub-taggings at every stage: one for each tag that is possible for the lexform at k.
For each of those tags, record the best sub-tagging ending with it using the above
formula with n = k, & its probability.
To find the best sub-taggings at the next stage, we just extend the best one for each tag
at the current stage: no point extending non-best ones.
Each extension step multiplies the probability by a P(lexformk+1|Tk+1) * P( Tk+1 | Tk)
Position 1
Lex-form L1
Position 2
L2
Position 3
L3
Position 4 = n
L4
Best-tagging
Best sub-tagging up to 3
T1:i
P(T1:i|Ø) P(L1|T1:i)
T1:ii
T4:i
P(T2:i|T1:i
)
P(L2|T2:i)
T3:i
P(L4|T4:i)
P(L3|T3:i)
T2:i
P(L1|T1:ii)
T3:ii
T4:ii
Ø
P(L3|T3:ii)
T1:iii
P(L4|T4:ii)
P(L1|T1:iii)
T2:ii
T1:iv
P(L1|T1:iv)
P(T1:v|Ø)
P(L2|T2:ii)
P(T2:ii|T1:v)
T3:iii
P(L3|T3:iii)
T4:iii
P(L4|T4:iii)
T1:v
T3:iv
P(L1|T1:v)
P(L3|T3:iv)
P(T4:iii|T3:iv)
Outline of the Viterbi Algorithm
• At position k in the lexical-form sequence:
– ASSUMPTION: For each tag t in the tag-set, you’ve calculated the best tag
subsequence that ends with t at position k
• and you’ve calculated a probability value for it:
P(lexform-seq up to k |that subseq up to tag t at k) * P(that subseq up to tag t at k) ).
So altogether we have T best subsequences up to and including position k, where T
is the size of the tag-set.
– For each tag u in the tag-set, EXTEND each such best subsequence to go to position
k+1 by adding u to the subsequence. So we now have T2 subsequences …
– … and in fact for each tag u at position k+1 we have T subsequences ending at u
(each is the extension of a best subsequence to position k).
• CALCULATE The probability value for each of these subsequences up to u at k+1 as the value for the
subsequence up to t at k that it was formed from times P(lexform k+1 | u) times P(u|t).
– For each tag u at position k+1, FIND THE BEST of those subsequences up to it, i.e.
the one with the highest probability value. You now loop to the top of this page,
with k+1 replacing k.
Outline contd.
• STARTING the algorithm, at position 0, just before the first lexical form:
Consider there to be the bogus tag TØ there,
– and one best subsequence, consisting just of this tag,
– where this subsequence has probability value 1.
• TERMINATING the algorithm: one way is:
At position n, at the end of the lexical-form sequence, take the answer to be
the best sequence up to this position (across all the tags).
• Another way:
Have a bogus tag TØ at position n+1, and go up to this position. The normal
execution of the general step at this position gives the best sequence up to it,
so we have an answer. In the probability caluclation, we take the lexform at n+1
to be a bogus word that is necessarily there, so
– we take P( lexform n+1 | TØ) to be 1 and P(TØ|t) = P(t occurs at the end of a sentence).
The Algorithm as Special Breadth-First Search
[OPTIONAL]
• Can be seen as a breadth-first search with special features:
– Each (non-zero-valued) node is connected to all nodes at next level.
– Each node on the last level is a goal node, and are all known in advance.
– We have to find the best-valued path.
– The value of the path is not found by adding arc costs but by multiplying arc
values and also multiplying in some node values.
• But if you absorbed the node values into modified arc values and took negated logarithms
you could use additions to get a traditional cost.
• The need to keep a record of the best-valued path to any node,
and its value, is the same as the need in A* to keep a record of the
least-cost path to any node together with its cost.
Its Computational Complexity
• O(WT2) computation steps, where W is the number of
positions (= length of the lexical-form sequence) and T is the
max number of tags that need to be considered at a position in
the given sequence.
– T needs to count all tags at each position, unless we can definitely
exclude some in some positions.
• If didn’t have the “iterative maximality” property above, might
have had to compute all possible tag sequences, of which
there are up to TW
“Viterbi” Algorithm
• The algorithm is an example of a so-called “Viterbi” algorithm.
• Such algorithms used in various sectors of NLP.
• It’s actually just an application of a widely known technique called
Dynamic Programming (see J&M).
Hidden Markov Model
• The above network is an example of a Hidden Markov Model.
• A Markov Model is like a Finite State Network but with probabilities
attached to the arcs (transitions). In our case:
– The states are tags-at-positions.
– The transition probabilities are the P( Ti | Ti-1) values.
– We don’t really need to label the arcs, but to relate more closely to FSNs, we
can regard the arc label as being the relevant tag pair.
• A Hidden Markov Model is a Markov Model plus the idea that you
can’t observe the states themselves but only derived information (the
lexical-forms in our case), where that observable information is
related by a conditional probability (= “observation likelihood” or “emission
probability”) to the state. In our case, for each state:
– the lexform at the state’s position in sequence , & the P(lexform| tag) value.
POS Tagging as HMM by Viterbi
• So our POS tagging is being done by converting the task into a standard thing to do
with an HMM (find the best hidden path given the observations – the “decoding”
task); and we do it by means of a standard (“Viterbi”) algorithm.
Refinements
• Can include also an end-of-sequence marker/tag pair Ø/Ø, to take
care of constraints on how a sentence can end as well as on how it
can start.
• Can make the P(lexform-seq| tagging) approximation cleverer,
e.g. by replacing the lexical-form likelihoods—the P(lexform| tag)
values—by P(lexform| tag-before, tag, tag-after), and can even
include neighbouring lexical forms in the condition.
• Contd ...
Refinements
• If the hand-tagged corpus doesn’t provide a particular lexform
for a particular tag, then P(lexform| tag) is estimated as zero,
but it could be wise to assign it a tiny value just in case, unless
you know it definitely cannot happen (NB: but can you ever
know this, given the possibility of errors in the corpus??).
• Similarly, if the hand-tagged corpus doesn’t contain a particular
tag bigram or trigram, this may be because it’s impossible, but
it’s safer to assume it’s possible but rare. So ....
Coping with missing tag n-grams
• If bigram Ti , Ti-1 is missing, can give a tiny value (e.g., 0.0001) to
P( Ti | Ti-1). Similarly for trigrams.
• More sophisticated: use the probability for the next shorter ngram (even 1-gram – i.e. single lex-form). E.g.,
if Ti , Ti-1 missing, use P( Ti ) instead of P( Ti | Ti-1).
if Ti , Ti-1 , Ti-2 missing, use P( Ti | Ti-1) instead of P( Ti | Ti-1, Ti-2 ).
• Yet more sophisticated: combine such estimators in a weighted
formula, e.g. for trigrams use
λP( Ti | Ti-1, Ti-2 ) + μP( Ti | Ti-1) + νP( Ti )
See J&M on how to choose the coefficients λ, μ, and ν.