Transcript Document

Word Prediction
• Words do not randomly appear in text. The probability of a
word appearing in a text is to a large degree related to the
words that have appeared before it.
– e.g. I’d like to make a collect ...
– Call is the most likely next word, but other words such
as telephone, international... are also possible. Other
(very common) words are unlikely (e.g. dog, house).
Word Prediction
• Word prediction is very useful for applications such as:
– Speech recognition: It is possible to select between
words that are hard for a speech recognizer to
distinguish.
– Augmentative communication for the disabled: Speech
generation systems can become more effective.
– Spelling error detection:
• They are leaving in about 15 minuets.
• He is trying to fine out.
• Word prediction is also related to the problem of
computing the probability of a sentence.
Counting Words in Corpora
• Text corpus: A collection of text (or speech).
– Brown corpus: 1M words of text.
– Switchboard corpus: 3 million words of speech.
• Word counting in corpora:
– Punctuation may count as words or not.
– Are “don’t”,”O’Reilly”,“non-carbonated” one or two
words.
– Are “They” and “they” different or the same word.
• Many of these choices depend on the application.
Words in Corpora
• Tokens: Total numbers of running words in corpus
(possibly including punctuation).
• Types: Number of distinct words in corpus.
• Wordform: The inflected word as it appears in the corpus.
e.g. cat =/= cats.
• Lemma: Set of word forms having the same stem and the
same word sense. e.g. cat and cats are not distinguished.
– Switchboard corpus: 2.4M tokens, 20,000 types
– Brown corpus: 1M tokens, 61,805 types, 37,851
lemmata
Simple N-Grams
• Probabilistic models of word sequence
• Simplest model: Every word may follow any other word.
All words have equal probability.
• More complex: The probability of appearance of each
word depends on its frequency in the corpus. e.g.
– the appears 69,971 times in Brown corpus (7%)
– rabbit appears 12 times (0.001%)
• But suppose we have the sentence:
– Just then the white...
Markov Assumption
• The probability of the appearance of a word depends on the words that
have appeared before it.
P(rabbit | Just then the white)
• Impossible to calculate this probability from a corpus. The exact word
sequence would have to appear in the corpus.
• Markov simplifying assumption: we approximate the probability of a
word given all the previous words with the probability given only the
previous word.
P(rabbit | Just then the white)= P(rabbit | white)
• First used by Markov (1913) to calculate if the upcoming letter in
Pushkin’s Eugene Onegin would be a vowel or a consonant
Bigrams, Trigrams ... N-grams
• When looking only in the previous word we call the model
a bigram model (or first order Markov model).
• When looking two words back we have a trigram model
(or second order Markov model).
• Generally, looking N-1 words back, we get an N-gram
model (or N-1th order Markov model).
• The probability of a sentence is calculated by multiplying
the (approximated) probability of each word in the
sentence.
• e.g: P(I want to eat Chinese food)=P(I | <s>)P(want | I)P(to
| want)P(eat | to)P(Chinese | eat)P(food | Chinese)
N-gram Probability Calculation
• We count the appearances of a bigram in our training
corpus and then we normalize by dividing by the number
of appearances of the first word.
P(Wn | Wn-1)=C(Wn-1 Wn) / C(Wn-1)
• Problem with N-grams is that many valid N-grams will be
missing from the finite training corpus. Many bigrams that
should have same probability will have zero, or very small
probability.
I: 3437 , want:1215 , to:3256 , eat: 938 ,
Chinese: 213 , food: 1506 , lunch: 459
Smoothing
•
•
Smoothing (or discounting) is the process of assigning some probability to
zero or low probability N-grams.
Add-one smoothing is the adding of one appearance to every possible Ngram. This method doesn’t work well because it takes too much probability
away from likely N-grams and shares it among many impossible N-grams. e.g.
P(to | want) changes from 0.65 to 0.28
Smoothing
• Witten-Bell Discounting: We use the count of N-grams we have only
seen once to help us estimate the count of things we have not yet seen.
We compute the probability of seeing a new N-gram for the first time by
counting the N-grams in the corpus that have appeared at least once. This
moves far less probability from likely to unlikely events and gives better
models than the add-one smoothing.
Training and Test sets
• The statistical parameters of a model are trained on a training corpus
and then the quality of the model is tested on a test set.
• Selection of training corpus is important. If it is too specific it might
not generalize well to new sentences. If it is too general it may not be
reflective of the domain of the application where we want to use it.
• Selection of the test sentences is also critical. If the test sentence comes
from the training corpus then the performance will be artificially high.
If the domain of the test corpus is different than that of the training
corpus then the statistical model will not reflect the properties of the
test set.
• Usually we have the same corpus and we divide it into a training set
and a test set.
N-grams and Training Corpora
• Example of artificially generated sentences showing the performance
of N-grams for various N and the dependence of the model on the
domain training corpus.
• Works of Shakespeare
– Unigram: Will rash been and by I the me loves gentle me not
slavish page, the and hour, ill let
– Bigram: Why dost stand forth thy canopy, forsooth; he is this
palpable hit the King Henry. Live king. Follow.
– Trigram: Sweet Prince, Falstaff shall die. Harry of Monmouth’s
grave.
– Quadrigram: Enter Leonato’s brother Antonio, and the rest, but
seek the weary beds of people sick.
N-grams and Training Corpora
• Similar sentences generated from the Wall Street journal
corpus.
– Unigram: Months the my and issue of wear foreign new
exchange’s September were recession exchange new
endorsed a acquire to six executives.
– Bigram: Last December through the way to preserve
the Hudson corporation N. B. E. C. Taylor would seem
to complete the major central planners ...
– Trigram: They also point to ninety nine point six billion
dollars from two hundrent oh six three per cent of the
rates of interest ...
Context-Sensitive Spelling Error
Correction
• Spelling errors which result in real words cannot be found
without looking at the context of a word.
– According to empirical studies between 15% to 40% of
spelling errors result in real words.
• Local errors are errors that can be detected by looking at
the immediate surrounding of the words. Global errors
require looking at a larger context.
Local and Global Errors
Context-Sensitive Spelling Error
Correction
• Generate every possible misspelling of a word that results
in a real word. These are generated based on typographical
transformations (letter insertion, deletion, substitution) or
based on lists of homophones (e.g. piece – peace)
• Then given all these possible words we select the word that
maximizes the N-gram probability of the sentence that
contains the word in question.
Part of Speech Tagging (POST)
• POST is the process of assigning to each word in a text its part of
speech or in general its word class .
• Input: A text and a set of possible word classes (tagset).
• Output: The text, where next to each input word its word class has
been marked.
A good post is often the first step for many applications such as:
• Speech Generation and Recognition
• Machine Translation
• Information Retrieval
• Computational Lexicography
• Word sense disambiguation
• Prepositional phrase attachment disambiguation
Word Classes and Tagsets
• For English there is a number of different tagset differing in the
number of word classes contained in each.
– Brown Corpus tagset: 87
– Penn Treebank tagset: 45
– C5 (British National Corpus): 67
– C7: 146
• For Greek different tagsets have been used ranging from 58 to 584
word classes.
Example of Sentence with POS annotation:
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN
of/IN other/JJ topics/NNS ./.
Penn Treebank Tagset
WC
Description
Example
WC
Description
Example
CC
Coordinating conjunction
and, but, or
SYM
Sy mbol
+, %, &
CD
Cardinal number
two, TO
"to"
to
DT
Determiner
one,
three
a, the
UH
Interjection
ah, oop s
EX
Existential "there"
There
VB
Verb, base form
eat
FW
Foreign word
mea culp a
VBD
Verb, p ast tense
ate
IN
Prep osition or subordinating
conjunction
of, in ,by
VBG
Verb, gerund or p resent eating
p articip le
JJ
Adjective
y ellow
VBN
Verb, p ast p articip le
JJR
Adjective, comp arative
bigger
VBP
Verb, non-3rd p erson eat
singular p resent
JJS
Adjective, sup erlative
wildest
VBZ
Verb,
3rd
p erson eats
singular p resent
LS
MD
List item marker
M odal
1, 2, One
can, should
WDT
WP
Wh-determiner
Wh-p ronoun
NN
Noun, singular or mass
llama
WP$
Possessive
p ronoun
NNS
NNP
Noun, p lural
Prop er noun, singular
llamas
IBM
WRB
$
Wh-adverb
Dollar sign
how, where
$
NNPS
Prop er noun, p lural
Carolinas
#
Pound sign
#
PDT
Predeterminer
all, both
“
Left quote
‘ “
POS
Possessive ending
‘s
”
Right quote
’ ”
PRP
Personal p ronoun
I, y ou, he
(
Left p arenthesis
{ ( [ <
PRP$
Possessive p ronoun
y our, one’s
)
Right p arenthesis
} ) ] >
RB
Adverb
quickly ,
never
,
Comma
,
RBR
Adverb, comp arative
faster
.
Sentence-final
p unctuation
. ! ?
RBS
Adverb, sup erlative
fastest
:
M id
p unctuation
RP
Particle
up , off
eaten
which, that
what, who
wh- whose
sentence : ; … -- -
Ambiguity in POST
•
Ambiguity is a serious problem for POST, i.e. a word may belong to more than
one word classes. E.g.
Book/VB that/DT flight/NN.
Book: VB(verb) ή NN(noun)
That: DET(determiner) ή IN (subordinative conjuction)
• Many of the most commonly occurring words are ambiguous: only 11.5% of
the lemmata in the Brown Corpus are ambiguous, but 40% of the word tokens
are ambiguous.
Word Classes
Number of Words
1 WC
35340
2 WC
3760
3 WC
264
4 WC
61
5 WC
12
6 WC
2
7 WC
1 (still)
POST Algorithms
• Rule-Based: They use a large pool of hand written rules
which are used for word class disambiguation.
• Stochastic and Probabilistic: They use a training corpus to
calculate the probability that a certain word belongs to a
certain word class, depending on the context of the word in
the text.
• Transformation-based Learning (Brill algorithm):
Combination of rule-based and stochastic approaches.
Rule-Based POST
• Usually two-stages:
– Use a dictionary or morphological analysis to assign to each word
all its possible POS.
– Use a list of hand-written disambiguation rules to select only one
POS for each word.
– ENGTWOL tagger:
– First Stage: Uses a 2-level morphological analyses to find possible
word classes for each word. Also uses some additional syntactic
information (e.g. Verb subcategorisation)
– Second Stage: Uses about 1,100 constraints to rule out incorrect
POS
ENGTWOL Rules
• Rules used to eliminate tags inconsistent with the context. E.g:
ADVERBIAL-THAT RULE
Given input: “that”
If
(+1 A/ADV/QUANT); // If the next word is adj, adverb or quantifier
(+2 SENT-LIM); // and following is a sentence boundary
(NOT –1 SVOC/A); // and the previous is not a verb that allows adj as
//complements (e.g. consider)
then eliminate non-ADV tags
else eliminate ADV tag
• This rule checks if the word “that” has an adverbial reading: e.g. This
isn’t that odd.
Stochastic Part-of-Speech Tagging
• The intuition is to pick the most-likely tag for each word.
• We select the tag that maximizes the probability:
P(word|tag)*P(tag|previous n tags)
• Markov simplifying assumption:
P(tag|previous n tags)=P(tag|previous tag)
so we want to maximize P(word|tag)* P(tag|previous tag)
• The probabilities P(word|tag) and P(tag1|tag2) have been
calculated by counting occurrences of words and tags in
annotated training corpora.
Stochastic POST Example
• Find the POS of the word “race” in the sentence:
Secreteriat is expected to race tomorrow.
• Is race a noun (NN) or a verb (VB). Assuming we know that the
correct tagging for to is TO we have to calculate:
P(VB|TO)P(race|VB) and P(NN|TO)P(race|VB)
• From frequency counts of the Brown and the Switchboard corpus we
have:
P(NN|TO) = .021, P(VB|TO) = .34,
P(race|NN)=.00041, P(race|VB)=.00003
So, P(VB|TO)P(race|VB)=.00001 > P(NN|TO)P(race|NN)=.000007
• To tag an entire sentence instead of just one word, we use a Hidden
Markov Model
Transformation-Based POST
• Like rule-based systems it is using rules to decide which word class
will be selected for each word. However, like stochastic systems, it
learns these rules from a preannotated training corpus.
• An ordered list of transformations is kept by the algorithm. These rules
are applied on the text we want to tag. Each transformation is
composed of a rewrite rule and an activation condition. e.g.
•
•
•
.
Rewrite Rule: Change the tagging from MD (modal) to NN (noun)
Activation Condition: The tagging of the previous word is DT (determiner)
The/DT can/MD rusted/VB ./.  The/DT can/ΝΝ rusted/VB ./.
Learning of Transformation Rules
• Initially the training text is given as input without any
tagging and an initial tagging is performed. e.g. most likely
tags.
• In each training stage a number of possible transformations
(based on certain patterns) is performed on the text Of
these transformations we select the transformation that
produced the annotation that is closest to the correct
annotation of the training text.
• The training continues iteratively until no transformation
can be found that causes any improvement in the tagging
of the text.
Learning of Transformation Rules
Allowed Transformations
•
The transformations may be lexicalized or not.
Non Lexicalized:
Change the tagging of a word from A to B if:
– The previous word is tagged as Z
– One of the two following words is tagged as Z
– The previous word is tagged as Z and the next as W
Lexicalized:
Change the tagging of a word from A to B if:
– The previous word is w
– One of the two following words is w
– The previous word is w, the next word is x and the previous word is
tagged as Z
Accuracy of Tagging
• Factors affecting accuracy:
– Amount of training data available.
– Tag set.
– Difference between the training corpus and dictionary and the
corpus of application.
– Number of unknown words (coverage of the dictionary).
• A ‘dumb’ tagger that always assigns the most common part of speech
can get accuracy of about 90% (for English). Most taggers report
accuracies between 95-97%.
• These number can be somewhat misleading. 97% means probability of
63% to get every word in a 15 word sentence correct. When POS is
just preprocessing for other applications, very high accuracy is
required.
Collocations
• A collocation is an expression of two or more words that correspond to
a conventional way of saying things. e.g. strong tea, weapons of mass
destruction, make up …
• Various definitions. e.g. (Choueka, 1988)
– A collocation is defined as a sequence of two or more consecutive
words, that has characteristics of a syntactic and semantic unit and
whose exact and unambiguous meaning or connotation cannot be
derived directly from the meaning or connotation of its
components.
• Adjacency of words is not always necessary (e.g. knock...door).
Applications of Collocations
• Language Generation: To make the generated language
sound natural.
• Computational Lexicography: To identify important
collocations for listing in dictionaries.
• Parsing: For preferring parses containing natural
collocations.
• Corpus Linguistic Research: For studying the social
phenomena as they are expressed in the language.
Criteria for Collocations
• Non-compositionality: The meaning of a collocation is not
a straightforward composition of the meanings of its parts.
e.g. white in white wine, white hair and white woman
refers to somewhat different colors depending on the
collocation.
• Non-substitutability: We cannot substitute other words for
the components of a collocation. E.g. we cannot say yellow
wine instead of white wine.
• Non-modifiability: Most collocation cannot be freely
modified with additional lexical material. E.g. we cannot
say kicked the wooden bucket.
Discovering Collocations
• We can search for collocations based on their frequency in
a corpus
80.871
of
the
13.689
of
a
58.841
in
the
13.361
by
the
26.430
to
the
13.183
with
the
21.842
on
the
12.622
from
the
21.839
for
the
11.428
New
York
18.568
and
the
10.007
he
said
16.121
that
the
9.775
as
a
15.630
at
the
9.231
is
a
15.494
to
be
8.753
has
been
13.899
in
a
8.573
for
a
Most common bigrams from the New York Times corpus (14M words)
Discovering Collocations
• But such bigrams are not really collocations. A better
approach is to filter based on part-of-speech.
11.487
New
York
A-N
7.261
United
States
A-N
5.412
Los
Angeles
N-N
3.301
last
year
A-N
3.191
Saudi
Arabia
N-N
2.699
last
week
A-N
2.514
vice
president
A-N
Mean and Variance
• Often collocations are not consecutive words. E.g knock
and door in
– She knocked on his door
– A man knocked on the metal front door
– They knocked at the door
• We need to count frequencies of co-occurrence of the
words in a window (usually 3 to 4 words to the side of
each word).
• We can also calculate the mean and the variance of the
distances between the words. Low variance means that the
words have a relatively fixed structural relationship (i.e.
they usually occur at the same distance)
Hypothesis Testing
• Based on frequency it is possible that coocurrance of two frequently
occurring words might be accidental.Hypothesis testing is a method for
testing if this co-occurrence is accidental. We formulate a null
hypothesis that these co-occurrences are accidental and compute the
probability p of the words occurring together if the hypothesis is true.
If p is too low we reject the hypothesis (i.e. the words are related).
• Various test have been proposed in the statistics literature for
determining the probability of the words occurring together: t test,
testing of differences, Person’s chi-test, likelihood ration.
• Mutual Information is a measure of word relation based on information
theory. It tells us how much the information we have about the
occurrence of a word increases when we have information about the
occurrence of another word.