Lexical acquisition

Download Report

Transcript Lexical acquisition

I256
Applied Natural Language
Processing
Fall 2009
Lecture 8
• Words
– Lexical acquisition
– Collocations
– Similarity
– Selectional preferences
Barbara Rosario
Lexical acquisition
• Develop algorithms and statistical
techniques for filling the holes in existing
dictionaries and lexical resources by
looking at the occurrences of patterns of
words in large text corpora
– Collocations
– Semantic similarity
– Logical metonymy
– Selectional preferences
2
The limits of hand-encoded lexical
resources
• Manual construction of lexical resources is
very costly
• Because language keeps changing, these
resources have to be continuously
updated
• Quantitative information (e.g., frequencies,
counts) has to be computed automatically
anyway
3
The coverage problem
4
From CS 224N / Ling 280, Stanford, Manning
Lexical acquisition
• Examples:
– “insulin” and “progesterone” are in WordNet 2.1 but
“leptin” and “pregnenolone” are not.
– “HTML” and “SGML”, but not “XML” or “XHTML”.
– “Google” and “Yahoo”, but not “Microsoft” or “IBM”.
• We need some notion of word similarity to
know where to locate a new word in a
lexical resource
5
Lexical acquisition
• Lexical acquisition problems
– Collocations
– Semantic similarity
– Logical metonymy
– Selectional preferences
6
Collocations
• A collocation is an expression consisting of two or more
words that correspond to some conventional way of
saying things
– Noun phrases: weapons of mass destruction, stiff breeze (but
why not *stiff wind?)
– Verbal phrases: to make up
– Not necessarily contiguous: knock…. door
• Limited compositionality
– Compositional if meaning of expression can be predicted by the
meaning of the parts
– Idioms are most extreme examples of non-compositionality
• Kick the bucket
– In collocations there is an element of meaning added to the
combination (i.e. the exact meaning cannot be derived directly
form its components)
• White hair, white wine, white woman
7
Collocations
• Non Substitutability
– Cannot substitute words in a collocation
• *yellow wine
• Non modifiability
– To get a frog in one’s throat
• *To get an ugly frog in one’s throat
• Useful for
– Language generation
• *Powerful tea, *take a decision
– Machine translation
• Easy way to test if a combination is a collocation is to
translate it into another language
– Make a decision  *faire une decision (prendre), *fare una
decisione (prendere)
8
Subclasses of collocations
• Light verbs
– Make a decision, do a favor
• Phrasal verbs
– To tell off, make up
• Proper names
– San Francisco, New York
• Terminological expressions
– Hydraulic oil filter
• This is compositional, but need to make sure, for example
that it’s always translated the same
9
Finding collocations
• Frequency
– If two words occur together a lot, that may be
evidence that they have a special function
– But if we sort by frequency of pairs C(w1, w2), then “of the” is
the most frequent pair
– Filter by POS patterns
– A N (linear function), N N (regression coefficients) etc..
• Mean and variance of the distance of the words
• For not contiguous collocations
– She knocked at his door (d = 2)
– A man knocked on the metal front door (d = 4)
– Hypothesis testing (see page 162 Stat NLP)
• How do we know it’s really a collocation?
• Low mean distance can be accidental (new company)
• We need to know whether two words occur together by
chance or not (because they are a collocation)
– Hypothesis testing
10
Finding collocations
• Mutual information measure
– A measure of how much a word tells us about the
other, i.e. the reduction in uncertainty of one word due
to knowing about another
• 0 when the two words are independent
I(x, y)  log
p(x,y)
p(x) p(y )
• (see Stat NLP page 66 and178)
11
Lexical acquisition
• Lexical acquisition problems
– Collocations
– Semantic similarity
– Logical metonymy
– Selectional preferences
12
Lexical and semantic similarity
•
•
•
Lexical and distributional notions of meaning similarity
How can we work out how similar in meaning words are?
What is it useful for?
– IR
– Generalization
• Semantically similar words behave similarly
– QA, inference…
•
We could use anything in the thesaurus
– Meronymy
– Example sentences/definitions
– In practice, by “thesaurus-based” we usually just mean using the isa/subsumption/hypernym hierarchy
•
Word similarity versus word relatedness
– Similar words are near-synonyms
– Related could be related any way
• Car, gasoline: related, not similar
• Doctor nurse fever: related (topic)
• Car, bicycle: similar
13
Semantic similarity
• Similar if contextually interchangeable
– The degree for which one word can be
substituted for another in a given context
• Suit similar to litigation (but only in the legal
context)
• Measures of similarity
– WordNet-based
– Vector-based
– Detecting hyponymy and other relations
14
WordNet: Semantic Similarity
• Whale is very specific (and baleen whale even more so),
while vertebrate is more general and entity is completely
general. We can quantify this concept of generality by
looking up the depth of each synset:
15
WordNet: Semantic Similarity
• Path_similarity: Two words are similar if nearby in thesaurus
hierarchy (i.e. short path between them)
– path_similarity assigns a score in the range 0–1 based on the shortest
path that connects the concepts in the hypernym hierarchy
• The numbers don’t mean much, but they decrease as we move
away from the semantic space of sea creatures to inanimate
objects.
16
WordNet: Path Similarity
17
From CS 224N / Ling 280, Stanford, Manning
WordNet: Path Similarity
• Problem with path similarity
– Assumes each link represents a uniform
distance
– Instead:
– Want a metric which lets us represent the cost
of each edge independently
– There have been a whole slew of methods
that augment thesaurus with notions from a
corpus (Resnik, Lin, …)
18
From CS 224N / Ling 280, Stanford, Manning
Vector-based lexical semantics
• Very old idea: the meaning of a word can be
specified in terms of the values of certain
`features’ (`COMPONENTIAL SEMANTICS’)
– dog : ANIMATE= +, EAT=MEAT, SOCIAL=+
– horse : ANIMATE= +, EAT=GRASS, SOCIAL=+
– cat : ANIMATE= +, EAT=MEAT, SOCIAL=-
• Similarity / relatedness: proximity in feature
space
19
From CS 224N / Ling 280, Stanford, Manning
Vector-based lexical semantics
20
From CS 224N / Ling 280, Stanford, Manning
General characterization of vectorbased semantics
• Vectors as models of concepts
• The CLUSTERING approach to lexical
semantics:
1. Define properties one cares about, and give values to each
property (generally, numerical)
2. Create a vector of length n for each item to be classified
3. Viewing the n-dimensional vector as a point in n-space,
cluster points that are near one another
• What changes between models:
1. The properties used in the vector
2. The distance metric used to decide if two points are `close’
3. The algorithm used to cluster
21
From CS 224N / Ling 280, Stanford, Manning
Distributional Similarity: Using words as
features in a vector-based semantics
• The old decompositional semantic approach requires
– i. Specifying the features
– ii. Characterizing the value of these features for each lexeme
• Simpler approach: use as features the WORDS that
occur in the proximity of that word / lexical entry
– Intuition: “You shall know a word by the company it keeps.” (J. R.
Firth)
• More specifically, you can use as `values’ of these
features
– The FREQUENCIES with which these words occur near the
words whose meaning we are defining
– Or perhaps the PROBABILITIES that these words occur next to
each other
• Some psychological results support this view.
22
From CS 224N / Ling 280, Stanford, Manning
Using neighboring words to specify
the meaning of words
• Take, e.g., the following corpus:
– John ate a banana.
– John ate an apple.
– John drove a lorry.
• We can extract the following co-occurrence matrix:
23
Acquiring lexical vectors from a
corpus
• To construct vectors C(w) for each word w:
1. Scan a text
2. Whenever a word w is encountered, increment all cells of C(w)
corresponding to the words v that occur in the vicinity of w,
typically within a window of fixed size
• Differences among methods:
– Size of window
– Weighted or not
– Whether every word in the vocabulary counts as a dimension
(including function words such as the or and) or whether instead
only some specially chosen words are used (typically, the m
most common content words in the corpus; or perhaps modifiers
only).
– The words chosen as dimensions are often called CONTEXT
WORDS
– (Whether dimensionality reduction methods are applied)
24
From CS 224N / Ling 280, Stanford, Manning
Variant: using only modifiers to
specify the meaning of words
25
From CS 224N / Ling 280, Stanford, Manning
The CLUSTERING approach to
lexical semantics
– Create a vector of length n for each item to be
classified
• Viewing the n-dimensional vector as a point in nspace, cluster points that are near one another
– Define a similarity measure (the distance metric
used to decide if two points are `close’)
• For example:
– (Eventually) clustering algorithm
26
From CS 224N / Ling 280, Stanford, Manning
The HAL model
•
Burges and Lund (95, 98)
– A 160 million words corpus of articles extracted from all
newsgroups containing English dialogue
– Context words: the 70,000 most frequently occurring
symbols within the corpus
– Window size: 10 words to the left and the right of the word
– Measure of similarity: cosine
– Frightened: scared, upset, shy, embarrassed, anxious,
worried, afraid
– Harmed: abused, forced, treated, discriminated, allowed,
attracted, taught
– Beatles: original, band, song, movie, album, songs, lyrics,
British
27
From CS 224N / Ling 280, Stanford, Manning
Latent Semantic Analysis
• Landauer at al (97, 98)
– Goal: extract expected contextual usage from
passages
– Steps:
• Build a word / document co-occurrence matrix
• `Weight’ each cell (e.g., tf.idf)
• Perform a DIMENSIONALITY REDUCTION
– Argued to correlate well with humans on a
number of tests
28
From CS 224N / Ling 280, Stanford, Manning
Detecting Hyponymy and other
relations with patterns
• Goal: discover new hyponyms, and add
them to a taxonomy under the appropriate
hypernym
– Agar is a substance prepared from a mixture of
red algae, such as Gelidium, for laboratory or
industrial use.
– What does Gelidium mean? How do you know?
29
Hearst approach
•
Hearst hand-built patterns:
30
From CS 224N / Ling 280, Stanford, Manning
Trained algorithm to discover patterns
• Snow, Jurafsky, Ng (05)
• Collect noun pairs from corpora
– (752,311 pairs from 6 million words of
newswire)
• Identify each pair as positive or negative
example of hypernym/hyponym relationship
– (14,387 yes, 737,924 no)
• Parse the sentences, extract patterns
(lexical and parses-paths)
• Train a hypernym classifier on these
patterns
31
From CS 224N / Ling 280, Stanford, Manning
32
From CS 224N / Ling 280, Stanford, Manning
Evaluation: precision and recall
•
•
Precision can be seen as a measure of exactness
or fidelity, whereas Recall is a measure of
completeness.
Used in information retrieval
– A perfect Precision score of 1.0 means that every result
retrieved by a search was relevant (but says nothing
about whether all relevant documents were retrieved)
whereas a perfect Recall score of 1.0 means that all
relevant documents were retrieved by the search (but
says nothing about how many irrelevant documents were
also retrieved).
– Precision is defined as the number of relevant documents
retrieved by a search divided by the total number of
documents retrieved
– Recall is defined as the number of relevant documents
retrieved by a search divided by the total number of
existing relevant documents (which should have been
retrieved).
33
Evaluation: precision and recall
• Classification context
•
•
A perfect Precision score of 1.0 for a class C means that every item labeled
as belonging to class C does indeed belong to class C (but says nothing
about the number of items from class C that were not labeled correctly)
A perfect Recall of 1.0 means that every item from class C was labeled as
belonging to class C (but says nothing about how many other items were34
incorrectly also labeled as belonging to class C).
Precision and recall: trade-off
• Often, there is an inverse relationship between Precision
and Recall, where it is possible to increase one at the
cost of reducing the other.
• For example, an search engine can increase its Recall
by retrieving more documents, at the cost of increasing
number of irrelevant documents retrieved (decreasing
Precision).
• Similarly, a classification system for deciding whether or
not, say, a fruit is an orange, can achieve high Precision
by only classifying fruits with the exact right shape and
color as oranges, but at the cost of low Recall due to the
number of false negatives from oranges that did not
quite match the specification.
35
36
From CS 224N / Ling 280, Stanford, Manning
Lexical acquisition
• Lexical acquisition problems
– Collocations
– Semantic similarity
– Logical metonymy
– Selectional preferences
37
Other lexical semantics tasks
• Metonymy is a figure of speech in which a thing
or concept is not called by its own name, but by
the name of something intimately associated
with that thing or concept.
– Examples:
• Logical Metonymy
– enjoy the book means enjoy reading the book, and 38
easy problem means a problem that is early to solve.
Other lexical semantics tasks
39
From CS 224N / Ling 280, Stanford, Manning
40
From CS 224N / Ling 280, Stanford, Manning
41
From CS 224N / Ling 280, Stanford, Manning
Lexical acquisition
• Lexical acquisition problems
– Collocations
– Semantic similarity
– Logical metonymy
– Selectional preferences
42
Selectional preferences
• Most verbs prefer arguments of a
particular type: selectional preferences or
restrictions
– Objects of eat tend to be food, subjects of
think tend to be people etc..
– “Preferences” to allow for metaphors
• Feat eats the soul
• Why is it important for NLP?
43
Selectional preferences
• Why Important?
– To infer meaning from selectional restrictions
• Suppose we don’t know the words durian (not in
the vocabulary)
• Susan ate a very fresh durian
• Infer that durian is a type of food
– Ranking the possible parses of a sentence
• Give higher scores to parses where the verbs has
‘natural argument”
44
Model of selectional preferences
• Resnick, 93 (see page 288 Stat NLP)
• Two main concepts
1. Selectional preference strength
–
How strongly the verb constrains its direct object
•
Eat, find, see
2. Selectional association between the verb and
the object semantic class
•
•
Eat and food
The higher 1 and 2 the less important is to
have an object (i.e. the more likely is to have
the implicit object construction)
•
Bo ate, but *Bo saw
45
Next class
• Next time: review
• Classification
• Project ideas (likely on October 6)
• Two more assignments (most likely)
• Project proposals (1-2 pages description)
• Projects
46