LexicalAcquisition
Download
Report
Transcript LexicalAcquisition
Lexical Acquisition
Extending our information about
words, particularly quantitative
information
Why lexical acquisition?
• “one cannot learn a new language by reading
a bilingual dictionary” -- Mercer
– Parsing ‘postmen’ requires context
• quantitative information is difficult to
collect by hand
– e.g., priors on word senses
• productivity of language
– Lexicons need to be updated for new
words and usages
Machine-readable Lexicons contain...
• Lexical vs syntactic information
√ Word senses
– Classifications, subclassifications
√ Collocations
– Arguments, preferences
– Synonyms, antonyms
– Quantitative information
Gray area between lexical and syntactic
• The rules of grammar are syntactic.
– S ::= NP V NP
– S ::= NP [V NP PP]
• But which one to use, when?
– The children ate the cake with their
hands.
– The children ate the cake with blue
icing.
Outline of chapter
• verb subcategorization
– Which arguments (e.g. infinitive, DO)
does a particular verb admit?
• attachment ambiguity
– What does the modifier refer to?
• selectional preferences
– Does a verb tend to restrict its object
to a certain class?
• semantic similarity between words
– This new word is most like which words?
Verb subcategorization frames
• Assign to each verb the sf’s legal for it.
(see diagram)
• Crucial for parsing.
– She told the man where Peter grew up.
• (NP NP S)
– She found the place where Peter grew
up.
• (NP NP)
Brent’s method (1993)
• Learn subcategorizations given a corpus,
lexical analyzer, and cues.
• A cue is a pair <L,SF>:
– L is a star-free regular expression over
lexemes
• (OBJ | SUBJ-OBJ | CAP) (PUNC | CC)
– SF is a subcategorization frame
• NP NP
• Strategy: find verb sf’s for which the
cues provide strong evidence.
Brent’s method (cont’d)
• Compute the error rate of the cue E =
Pr(false positives)
• For each verb v and cue c = <L,SF>,
• Test the hypothesis H0 that verb v does
not admit SF.
– pE =
• If pE < a threshold, reject H0.
Subcategorization Frames: Ideas
• Hypothesis testing gives high precision, low
recall.
• Unreliable cues are necessary and helpful
(independence assumption)
• Find SF’s for verb classes, rather than
verbs, using a buggy tagger.
• As long as error estimates are
incorporated into pE, it works great.
• Manning did this, and improved recall.
Attachment Ambiguity: PPs
• NP V NP PP -- Does PP mdify V or NP?
• Assumption: there is only one meaningful
parse for each sentence:
x The children ate the cake with a spoon.
√ Bush sent 100,000 soldiers into Kuwait.
√ Brazil honored their deal with the IMF.
• Straw man: compare co-occurrence counts
between pairs <send, into> and <soldiers,
into>.
Bias defeats simple counting
• Prob(into | send) > Prob(into | soldiers).
• Sometimes there will be strong association
between PP and both V and NP.
– Ford ended its venture with Fiat.
• In this case, there is a bias toward “low
attachment” -- attaching PP to the nearer
referent, NP.
Hindle and Ruth (1993)
• Elegant (?) method of quantifying the low
attachment bias
• Express P(first PP after object attaches
to object) and P(first PP after object
attaches to verb) as a function of P(NA) =
P(there is a PP following the object
attaching to object) and P(VA) = P(there is
a PP following the object attaching to verb)
• Estimate P(NA) and P(VA) based on
counting
Estimating P(NA) and P(VA)
• <v,n,p> are a particular verb, noun, and
preposition
• P(VAp | v) =
– (# times p attaches to v)/(# occs of v)
• P(NAp | n) =
– (# times p attaches to n)/(# occs of v)
• The two are treated as independent!
Attachment of first PP
• P(Attach(p,n) | v,n) = P(NAp | n)
– Whenever there is a PP attaching to the
noun, the first such PP attaches to the
noun!
• P(Attach(p,v) | v,n) = P((not NAp) | n) P(VAp | v)
– Whenever there is no PP attaching to
the noun, AND a PP attaching to verb…
– I (put the [book on the table) on WW2]
Selectional Preferences
• Verbs prefer classes of subjects, objects:
– Objects of ‘eat’ tend to be food items
– Subjects of ‘think’ tend to be people
– Subjects of ‘bark’ tend to be dogs
• Used to
– disambiguate word sense
– infer class of new words
– rank multiple parses
Disambiguate the class (Resnick)
– She interrupted the chair.
• A(nc) = D(P(nc | v) || P(nc)) =
P(nc|v)log(P(nc|v)/P(nc))
• Relative entropy, or Kullback Leibler
distance
• A(furniture) = P(furniture | interrupted) *
log((P(furniture | interrupted) / P(furniture))
Estimating P(nc | v)
• P(nc | v) = P(nc,v) / P(v)
• P(v) is estimated to be the proportion of
occurrences v among all verbs
• P(nc,v) is proposed to be
– 1/N Σ(n in nc) C(v,n)/|classes(n)|
• Now just take the class with highest A(nc)
for maximum likelihood word sense.
Semantic similarity
• Uses
– classifying a new word
– expand queries in IR
• Are two words similar...
– When they are used together?
• IMF and Brazil
– When they are on the same topic?
• astronaut and spacewalking
– When they function interchangeably?
• Soviet and American
– When they are synonymous?
• astronaut and cosmonaut
Cosine is no panacea
• Corresponds to Euclidean distance between
points
• Should document-space vectors be treated
as points?
• Alternative: treat them as probability
distributions (after normalizing)
• Now, no reason to use cosine. Why not try
information-theoretic approach?
Alternatives distance metrics to cosine
• Cosine of square roots (Goldszmidt)
• L1 norm -- Manhattan distance
– Sum of absolute value of difference of
components
• KL Distance
– D(p || q)
• Mutual information (why not?)
– D(p ^ q || pq)
• Information radius -- information lost
describing both p and q by their midpoint.
– IRAD(p,q) = D(p||m) + D(q||m)