powerpoint file
Download
Report
Transcript powerpoint file
A stochastic Parts Program and
Noun Phrase Parser for
Unrestricted Text
by Kenneth Ward Church
HyungSuk Won
NLP Lab. CSE POSTECH
98-10-01
2
서론
Test : 400word sample => 99.5% correct
most errors are attributable to defects in the lexicon
remarkably few errors are related to the inadequacies of the
extremely over-simplified grammar ( a trigram model)
One might thought that ngram models weren’t adequate for
the task since it is well known that they are inadequate for
determining grammaticality.
eg. long distance dependency
BUT, for tagging application, the ngram approximation may be
acceptable since long distance dependencies do not seem to
be very important
Leech, Garside and Atwell : 96.7% in LOB corpus, using bigram
model modified with heuristics to cope with more important trigrams
NLP Lab. CSE POSTECH
3
1. How hard is Lexical Ambiguity
computational liguistics 를 하지 않는 보통사람들은 lexical
ambiguity를 중요한 것이 아니라는 strong intuition을 가지고 있다
반대로 CL의 expert들은 lexical ambiguity를 major issue라고
생각하고 있다.
Time flies like a arrow
Flying planes can be dangerous
:practically, any content word can be used as noun, verb or adjective,
and local context is not always adequate to disambiguate
하지만, Marcus의 말대로 대부분의 texts는 실제 그렇게 어렵지
않다.
“garden paths” : The horse raced past the barn fell
사람들이 위의 문장들을 보고 나면, 한 단어에 각각 하나의 pos를
assign하는 것은 가망없는 짓이라고 생각할 것이다.
NLP Lab. CSE POSTECH
4
2. Lexical Disambiguation Rules
Fidditch’s lexical disambiguation rule
(defrule n+prep!
“>[**n+prep]!=n[npstarters]”)
; a preposition is more likely than a noun before a noun phrase
; this type can be captured with bigram and trigram statistics
(more easy to obtain than Fedditch-type disambiguation rules)
if parser do not use frequency information, then every
possibility in the dictionary must be given equal weight
=>parsing is very difficult
(ex.) the Holy See
Dictionary tends to focus on what is possible, not on what is likely
(according to Webster’s Seventh New Collegiate Dictionary, every word
is ambiguous)
NLP Lab. CSE POSTECH
5
계속
(ex.) I see a bird
[NP [N city] [N school] [N committee] [N meeting]]
[NP [N I] [N see] [N a] [N bird]]
[S [NP [N I] [N see] [N a]] [VP [V bird]]]
NLP Lab. CSE POSTECH
6
3. The proposed methods
Word
I
see
a
bird
part of speech
PPSS 5837 NP
1
VB
771 UH
1
AT
23013 IN(French) 6
NN
26
(PPSS: pronoun, NP: proper noun, VB:verb, UH:interjection,
IN:preposition, AT:article, NN:noun)
lexical probability :
prob( PPSS |" I " ) freq( PPSS |" I ") freq(" I " )
contextual probability :
prob(VB | AT , NN ) freq(VB, AT , NN ) freq( AT , NN )
NLP Lab. CSE POSTECH
7
계속
A search is performed in order to find the assignment of part
of speech tags to words that optimizes the product of the
lexical and contextual probabilities
과정
(“NN”)
(“AT” “NN”) (“IN” ‘NN”)
(“VB” “AT” “NN”) (“VB” “IN” “NN”) (“UH” “AT” “NN”) (“UH” “IN” “NN”)
=> PPSS VB IN NN, NP VB IN NN, PPSS UH IN NN, NP UH IN NN
; score less well than below, contextual scoring function has limited
window of three parts of speech
(“PPSS” “VB” “AT” ‘NN”) (“NP” “VB” “AT” “NN”)
(“PPSS” “UH” “AT” “NN”) (“NP” “UH” “AT” “NN”)
=> the same reason to above
(“” “PPSS” “VB” “AT” ‘NN”) (“” “NP” “VB” “AT” “NN”)
finally, (“” “” “PPSS” “VB” “AT” ‘NN”)
NLP Lab. CSE POSTECH
8
4. Parsing simple non-recursive noun
phrases stochastically
similar stochastic methods are applied to locate simple noun
phrases with very high accuracy
stochastic parser
input: a sequence of parts of speech
processing : insert brackets corresponding to the beginning and end
of noun phrases
output : [A/AT former/AP top/NN aide/NN] to/IN [Attorney/NP ….]…
(ex.) NN VB
NN VB, [NN] VB, [NN VB], [NN] [VB], NN [VB]
NLP Lab. CSE POSTECH
9
계속
AT
NN
NNS
VB
IN
Probability of starting a noun phrase
AT
NN
NNS
VB
0
0
0
0
0.99
0.01
0
0
1
0.02
0.11
0
1
1
1
0
1
1
1
0
IN
0
0
0
0
0
AT
NN
NNS
VB
IN
Probability of ending a noun phrase
AT
NN
NNS
VB
0
0
0
0
1
0.01
0
0
1
0.02
0.11
1
0
0
0
0
0
0
0
0
IN
0
1
1
0
0.02
AT(article), NN(singular noun), NNS(non-singular noun),
VB(uninflected verb), IN(preposition)
these probabilities were estimated from about 40,000 words(11,000
noun phrases) of training material selected from the Brown Corpus
NLP Lab. CSE POSTECH
10
5. Smoothing Issues
Zipf’s Law
1
frequency
rank
alleviate
no appear case : using conventional dictionary => add 1 to the
frequency count of possibilities in the dictionary
proper noun and capitalized words
=>capitalized words with small frequency counts (<20) were
thrown out of the lexicon
(ex) Act/NP
1. add 1 for the proper noun possibility
(ex.) fall ( (1 “JJ”) (65 “VB”) (72 “NN”) )
Fall ( (1 “NP”) (1 “JJ”) (65 “VB”) (72 “NN”) )
2. prepass : labels words as proper nouns
if they are “adjacent to” other capitalized words
(ex) White House, States of the Union
NLP Lab. CSE POSTECH