powerpoint file

Download Report

Transcript powerpoint file

A stochastic Parts Program and
Noun Phrase Parser for
Unrestricted Text
by Kenneth Ward Church
HyungSuk Won
NLP Lab. CSE POSTECH
98-10-01
2
서론
 Test : 400word sample => 99.5% correct

most errors are attributable to defects in the lexicon

remarkably few errors are related to the inadequacies of the
extremely over-simplified grammar ( a trigram model)
 One might thought that ngram models weren’t adequate for
the task since it is well known that they are inadequate for
determining grammaticality.

eg. long distance dependency
 BUT, for tagging application, the ngram approximation may be
acceptable since long distance dependencies do not seem to
be very important

Leech, Garside and Atwell : 96.7% in LOB corpus, using bigram
model modified with heuristics to cope with more important trigrams
NLP Lab. CSE POSTECH
3
1. How hard is Lexical Ambiguity
 computational liguistics 를 하지 않는 보통사람들은 lexical
ambiguity를 중요한 것이 아니라는 strong intuition을 가지고 있다
 반대로 CL의 expert들은 lexical ambiguity를 major issue라고
생각하고 있다.

Time flies like a arrow

Flying planes can be dangerous
:practically, any content word can be used as noun, verb or adjective,
and local context is not always adequate to disambiguate
 하지만, Marcus의 말대로 대부분의 texts는 실제 그렇게 어렵지
않다.

“garden paths” : The horse raced past the barn fell
 사람들이 위의 문장들을 보고 나면, 한 단어에 각각 하나의 pos를
assign하는 것은 가망없는 짓이라고 생각할 것이다.
NLP Lab. CSE POSTECH
4
2. Lexical Disambiguation Rules
 Fidditch’s lexical disambiguation rule
(defrule n+prep!
“>[**n+prep]!=n[npstarters]”)
; a preposition is more likely than a noun before a noun phrase
; this type can be captured with bigram and trigram statistics
(more easy to obtain than Fedditch-type disambiguation rules)
 if parser do not use frequency information, then every
possibility in the dictionary must be given equal weight
=>parsing is very difficult
(ex.) the Holy See

Dictionary tends to focus on what is possible, not on what is likely
(according to Webster’s Seventh New Collegiate Dictionary, every word
is ambiguous)
NLP Lab. CSE POSTECH
5
계속
 (ex.) I see a bird
[NP [N city] [N school] [N committee] [N meeting]]
[NP [N I] [N see] [N a] [N bird]]
[S [NP [N I] [N see] [N a]] [VP [V bird]]]
NLP Lab. CSE POSTECH
6
3. The proposed methods
Word
I
see
a
bird
part of speech
PPSS 5837 NP
1
VB
771 UH
1
AT
23013 IN(French) 6
NN
26
(PPSS: pronoun, NP: proper noun, VB:verb, UH:interjection,
IN:preposition, AT:article, NN:noun)

lexical probability :
prob( PPSS |" I " )  freq( PPSS |" I ") freq(" I " )

contextual probability :
prob(VB | AT , NN )  freq(VB, AT , NN ) freq( AT , NN )
NLP Lab. CSE POSTECH
7
계속
 A search is performed in order to find the assignment of part
of speech tags to words that optimizes the product of the
lexical and contextual probabilities
 과정

(“NN”)

(“AT” “NN”) (“IN” ‘NN”)

(“VB” “AT” “NN”) (“VB” “IN” “NN”) (“UH” “AT” “NN”) (“UH” “IN” “NN”)
=> PPSS VB IN NN, NP VB IN NN, PPSS UH IN NN, NP UH IN NN
; score less well than below, contextual scoring function has limited
window of three parts of speech

(“PPSS” “VB” “AT” ‘NN”) (“NP” “VB” “AT” “NN”)
(“PPSS” “UH” “AT” “NN”) (“NP” “UH” “AT” “NN”)
=> the same reason to above

(“” “PPSS” “VB” “AT” ‘NN”) (“” “NP” “VB” “AT” “NN”)

finally, (“” “” “PPSS” “VB” “AT” ‘NN”)
NLP Lab. CSE POSTECH
8
4. Parsing simple non-recursive noun
phrases stochastically
 similar stochastic methods are applied to locate simple noun
phrases with very high accuracy
 stochastic parser

input: a sequence of parts of speech

processing : insert brackets corresponding to the beginning and end
of noun phrases

output : [A/AT former/AP top/NN aide/NN] to/IN [Attorney/NP ….]…
(ex.) NN VB
NN VB, [NN] VB, [NN VB], [NN] [VB], NN [VB]
NLP Lab. CSE POSTECH
9
계속
AT
NN
NNS
VB
IN
Probability of starting a noun phrase
AT
NN
NNS
VB
0
0
0
0
0.99
0.01
0
0
1
0.02
0.11
0
1
1
1
0
1
1
1
0
IN
0
0
0
0
0
AT
NN
NNS
VB
IN
Probability of ending a noun phrase
AT
NN
NNS
VB
0
0
0
0
1
0.01
0
0
1
0.02
0.11
1
0
0
0
0
0
0
0
0
IN
0
1
1
0
0.02

AT(article), NN(singular noun), NNS(non-singular noun),
VB(uninflected verb), IN(preposition)

these probabilities were estimated from about 40,000 words(11,000
noun phrases) of training material selected from the Brown Corpus
NLP Lab. CSE POSTECH
10
5. Smoothing Issues
 Zipf’s Law
1
frequency
rank
alleviate
 no appear case : using conventional dictionary => add 1 to the
frequency count of possibilities in the dictionary
 proper noun and capitalized words
=>capitalized words with small frequency counts (<20) were
thrown out of the lexicon
(ex) Act/NP
 1. add 1 for the proper noun possibility
(ex.) fall ( (1 “JJ”) (65 “VB”) (72 “NN”) )
Fall ( (1 “NP”) (1 “JJ”) (65 “VB”) (72 “NN”) )
 2. prepass : labels words as proper nouns
if they are “adjacent to” other capitalized words
(ex) White House, States of the Union
NLP Lab. CSE POSTECH