Metodi statistici nella linguistica computazionale

Download Report

Transcript Metodi statistici nella linguistica computazionale

A BAYESIAN APPROACH TO
SPELLING CORRECTION
‘Noisy channels’

In a number of tasks involving natural language, the
problem can be viewed as recovering an ‘original
signal’ distorted by a `noisy channel’:
–
–
–
–

Speech recognition
Spelling correction
OCR / handwriting recognition
(less felicitously perhaps): pronunciation variation
This metaphor has provided the justification for the
Bayesian approach to statistical NLP,which has found
application also outside these application areas
Spelling Errors
They are leaving in about fifteen minuets to go to her house
The study was conducted mainly be John Black.
The design an construction of the system will take more than
one year.
Hopefully, all with continue smoothly in my absence.
Can they lave him my messages?
I need to notified the bank of this problem.
He is trying to fine out.
Handwriting recognition

From Woody Allen’s Take the Money and Run (1969)
– Allen (a bank robber), walks up to the teller and
hands her a note that reads. "I have a gun. Give
me all your cash."
The teller, however, is puzzled, because he reads "I
have a gub." "No, it's gun", Allen says.
"Looks like 'gub' to me," the teller says, then asks
another teller to help him read the note, then
another, and finally everyone is arguing over what
the note means.
Spelling errors

How common are spelling errors?
– .005% in carefully edited newswire
– 1-3% in `normal’ human written text
– 20% of web queries are misspelled (Google
includes spelling correction algorithms)
– 38% in applications like directory lookup
– Handwriting recognition errors:
 Apple Newton: 2-3%
Types of spelling errors

Damerau (1964): 80% of all misspelled words (nonword errors) caused by SINGLE-ERROR
MISSPELLINGS:
–
–
–
–
INSERTION: the  ther
DELETION: the  th
SUBSTITUTION: the  thw
TRANSPOSITION: the  hte
Dealing with spelling errors
(Kukich, 1992)

3 increasingly broader problems:
–
–
–
NON-WORD ERROR DETECTION: ‘graffe’ instead
of ‘giraffe’
ISOLATED WORD-ERROR CORRECTION:
replacing ‘graffe’ with ‘giraffe’ without looking at
context
CONTEXT-DEPENDENT ERROR DETECTION /
CORRECTION: detecting also spelling errors that
result in a real world
Detecting non-word errors:
Dictionaries

Peterson, 1986: large dictionaries may do
more damage than good
–
–

wont
veery
Damerau and Mays (1989): no evidence this
was the case
The Noisy Channel model
Bayesian inference


`Bayesian inference’ is the name given to techniques
typically used in diagnostics to identify the CAUSE of
certain OBSERVATIONS
The name ‘Bayesian’ comes from the fact that Bayes’
rule is used to ‘turn around’ a problem: from one of
finding statistics about the posterior probability of the
CAUSE to one of finding the posterior probability of the
OBSERVATIONS
Bayesian inference: the equations


(These are equations that we will encounter again and
again for different tasks)
The statistical formulation of the problem of finding the
most likely `explanation’ for the observation:
wˆ  argmaxP( w | O)
wV

Using Bayes’ Rule, this probability can be `turned
around’:
P(O | w) P( w)
wˆ  argmax
P(O)
wV
Bayesian equations, 2


Some of these quantities are easy to compute, but
others much less so – especially P(O)
Fortunately, we don’t really need to compute this term!!
(It’s the same for ALL `explanations’)
likelihood prior




wˆ  argmaxP(O | w) P(w)
wV

This equation is a pattern that we will encounter again
and again.
Applying the Bayesian Method to
Spelling: Kernigham et al, 1990


correct takes words rejected by spell and
generates a list of potential correct words
Two steps:
1.
2.

Proposing candidate corrections
Scoring the candidates
An example of isolated word-error correction
Proposing candidate corrections


The noisy channel assumption: misspelled word the result of a
`noisy channel’ – the typist performing a single MISTYPING
OPERATION
Four possible operations:
–
–
–
–

INSERTION: x  xy
DELETION: xy  x
SUBSTITUTION: y  x
REVERSAL: xy  yx
At most one operation involved (cfr. Damerau, 1964)
Example: acress
Error
Correction
Corr letter
Error
letter
Position
Type
acress
actress
t
-
2
del
acress
cress
-
a
0
ins
acress
caress
ca
ac
0
transp
acress
access
c
r
2
sub
acress
across
o
e
3
sub
acress
acres
-
s
5
ins
Scoring the candidates

Choose the correction with the highest probability:
likelihood prior

 
cˆ  argmaxP(t | c) P(c)

cC
P(c): MLE estimation in a 44M words corpus, with
smoothing (Good-Turing)
c
freq(c)
p(c)
actress
1343
.0000315
cress
0
.000000014
caress
4
.0000001
access
2280
.000058
A simplification
THE TRAINING CORPUS:
acress actress actress
acress acres acres acres
WOULD WANT:
Likelihoods: P(acress|actress) = 1/3
P(acress|acress) = ¼
APPROXIMATE WITH:
P(acress|actress) = del[ct,c] / count[ct] = 1/3 (?)
Confusion matrices



Difficult to compute directly, but can be estimated by
looking at LOCAL FACTORS only
Entry [m,n] in a CONFUSION MATRIX for
SUBSTITUTION will tell us how often n is used instead
of m
Kernighan et al used four confusion matrices:
–
–
–
–
del[x,y] (number of times x is typed instead of correct xy)
ins[x,y] (number of times xy is typed instead of correct x)
sub[x,y] (number of times y is typed instead of correct x)
trans[x,y] (number of times yx is typed instead of correct xy)
Estimating the likelihood of a typo
 del[c p 1 , c p ]

 count[c p 1 , c p ]
 ins[c p 1 , t p ]
 count[c ]

p 1
P(t | c)  
sub[t p , c p ]

 count[c p ]
 trans[c , c ]
p
p 1

 count[c p , c p 1 ]
Resulting likelihoods
c
Freq(c) P(c)
P(t|c)
P(t|c)p(c)
%
37%
actress 1343
.0000315
.000117
3.69x10-9
cress
0
.000000014
.00000144
2.02x10-12 0%
caress
4
.0000001
.00000164
1.64x10-13 0%
access
2280
.000058
.000000209
1.21x10-11
0%
across
8436
.00019
.0000093
1.77x10-9
18%
acres
2879
.000065
.0000321
2.09x10-9
21%
acres
2879
.000065
.0000342
2.09x10-9
23%
Evaluation (3 judges, 329 triples)
Method
Results
Perc. Correct
correct
286/329
87±1.9%
no prior
263/329
80 ±2.2%
no channel
247/329
75 ±2.4%
neither
172/329
52 ±2.8
Judge 1
271/273
99 ±0.5
Judge 2
271/275
99 ±0.7
Judge 3
271/281
96 ±1.1
More sophisticated methods


MINIMUM EDIT DISTANCE: allow for the
possibility of more than one problem
N-GRAM models: use context (detect ‘real
words’)
References




Jurafsky and Martin, chapter 5
Kernighan, M. D., Church, K. W., and Gale, W. A.
(1990). A spelling correction method based on a noisy
channel model. COLING-90, 205-211.
Karen Kukich (1992). Techniques for automatically
correcting words in text. ACM Computing Surveys,
24(4), 377-439.
More recent work:
–
Brill, E. and Moore, R. An improved error model for noisy
channel spelling correction Proc. ACL 2000