I256 Applied Natural Language Processing Fall 2009

Download Report

Transcript I256 Applied Natural Language Processing Fall 2009

Natural Language Processing
: Lexical Acquisition
Lecture 8
Pusan National University
Minho Kim
([email protected])
Contents
•
•
•
•
•
•
Introduction
Evaluation Measures
Verb Subcategorization
Attachment Ambiguity
Selectinal Prefeences
Semantic Similarity
2
Lexical acquisition
• Develop algorithms and statistical
techniques for filling the holes in existing
dictionaries and lexical resources by
looking at the occurrences of patterns of
words in large text corpora
– Collocations
– Selectional preferences
– Subcategorization
– Semantic categorization
3
일반사전 - 표준국어대사전
표제어
발음 정보
활용 정보
정의문
예문
4
전자사전 - 세종전자사전
5
The limits of hand-encoded lexical
resources
• Manual construction of lexical resources is
very costly
• Because language keeps changing, these
resources have to be continuously
updated
• Quantitative information (e.g., frequencies,
counts) has to be computed automatically
anyway
6
Lexical acquisition
• Examples:
– “insulin” and “progesterone” are in WordNet 2.1 but
“leptin” and “pregnenolone” are not.
– “HTML” and “SGML”, but not “XML” or “XHTML”.
• We need some notion of word similarity to
know where to locate a new word in a
lexical resource
7
Evaluation Measures
Performance of system
• Information retrieval engine
–
–
–
–
–
사용자가 “NLP”와 관련된 문서를 찾고 싶음
실제 “NLP“와 관련된 문서는 많이 있음
검색 결과로 하나의 문서가 나옴
해당 문서는 “NLP”와 관련된 문서임
좋은 검색 시스템인가?
• Speller
–
–
–
–
사용자가 5개의 단어를 입력함
이 중 3개의 단어가 오류어임
시스템은 5개의 단어가 모두 오류라고 판단함
좋은 검사 시스템인가?
• Word sense disambiguation system
– 이 시스템은 항상 “사과”가 “apple”의 뜻이라고 판단함
– 좋은 시스템인가?
9
Evaluation: precision and recall
tn
fp
selected
tp
fn
target
• For many problems, we have a set of targets
– Information retrieval: relevant documents
– Spelling error correction: error words
– Word sense disambiguation: sense of ambiguous word
• Precision can be seen as a measure of exactness or fidelity,
whereas Recall is a measure of completeness.
10
Evaluation: precision and recall
• Used in information retrieval
– A perfect Precision score of 1.0 means that
every result retrieved by a search was relevant
(but says nothing about whether all relevant
documents were retrieved)
– A perfect Recall score of 1.0 means that all relevant
documents were retrieved by the search (but says
nothing about how many irrelevant documents were
also retrieved).
– Precision is defined as the number of relevant
documents retrieved by a search divided by the
total number of documents retrieved
– Recall is defined as the number of relevant
documents retrieved by a search divided by the
total number of existing relevant documents
(which should have been retrieved).
11
Evaluation: precision and recall
• Joint distribution of the variables
System
•
•
Actual
target
¬target
selected
tp
(true positives)
fp
(false positives)
¬selected
fn
(false negatives)
tn
(true negatives)
False negatives = Type Ⅰerrors
False positives = Type Ⅱerrors
12
Computing Recall/Precision
n doc # relevant
1 588
x
2 589
x
3 576
4 590
x
5 986
6 592
x
7 984
8 988
9 578
10 985
11 103
12 591
13 772
x
14 990
Let total # of relevant docs = 6
Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
R=3/6=0.5;
P=3/4=0.75
R=4/6=0.667; P=4/6=0.667
Missing one
relevant document.
Never reach
R=5/6=0.833; p=5/13=0.38 100% recall
13
Precision and recall: trade-off
Returns relevant documents but
misses many useful ones too
The ideal
Precision
1
0
Recall
1
Returns most relevant
documents but includes
lots of junk
14
F measure
• Combine precision and recall in a single measure of
overall performance.
P : Precision
R : Recall
α : a factor which determines the weighting
of P & R.
α= 0.5 is chosen often for equal weighting
15
Verb Subcategorization
Finding collocations
• Frequency
– If two words occur together a lot, that may be
evidence that they have a special function
– But if we sort by frequency of pairs C(w1, w2), then “of the” is
the most frequent pair
– Filter by POS patterns
– A N (linear function), N N (regression coefficients) etc..
• Mean and variance of the distance of the words
• For not contiguous collocations
– She knocked at his door (d = 2)
– A man knocked on the metal front door (d = 4)
– Hypothesis testing (see page 162 Stat NLP)
• How do we know it’s really a collocation?
• Low mean distance can be accidental (new company)
• We need to know whether two words occur together by
chance or not (because they are a collocation)
– Hypothesis testing
17
Finding collocations
• Mutual information measure
– A measure of how much a word tells us about the
other, i.e. the reduction in uncertainty of one word due
to knowing about another
• 0 when the two words are independent
I(x , y )  log
p(x,y)
p(x ) p(y )
• (see Stat NLP page 66 and178)
18
Main verbs
• Transitive
– requires a direct object (found with questions: what? or whom?)
• ?The child broke.
• The child broke a glass.
• Intransitive
– does not require a direct object.
• The train arrived.
• Some verbs can be both transitive and intransitive
– The ship sailed the seas. (transitive)
– The ship sails at noon. (intransitive)
– I met my friend at the airport. (transitive)
– The delegates met yesterday. (intransitive)
18
Verb Phrases
• VP --> head-verb complements adjuncts
• Some VPs:
–
–
–
–
–
–
Verb
Verb NP
Verb NP PP
Verb PP
Verb S
Verb VP
eat.
leave Montreal.
leave Montreal in the morning.
leave in the morning.
think I would like the fish.
want to leave.
want to leave Montreal.
want to leave Montreal in the morning.
want to want to leave Montreal in the morning.
20
Verb Subcategorization
• Verb Subcategorization : Verbs subcategorize for different
semantic categories.
• Verb Subcategorization Frame : A particular set of syntactic
category that a verb can appear with is called a
subcategorization frame.
21
Subcategorization frames
• Some verbs can take complements that others cannot
I want to fly.
* I find to fly.
• Verbs are subcategorized according to the complements
they can take --> subcategorization frames
– traditionally: transitive vs intransitive
– nowadays: up to 100 subcategories / frames
Frame
Verb
Example
empty
NP
eat, sleep
prefer, find
I eat.
I prefer apples.
NP NP
show, give
Show me your hand.
PPfrom PPto fly, travel
VPto
S
I fly from Montreal to Toronto
prefer, want I prefer to leave.
mean
Does this mean you are going to leave me?
22
Subcategorization frames: pasring
• Subcategorization is important for parsing
a) She told the man where Peter grew up.
b) She found the place where Peter grew up.
–Tell → NP NP (subject, object)
–Find → NP (subject)
23
Brent’s Lerner system
• There are two steps of this algorithm.
①Cues : Define a regular pattern of words and
syntactic categories which indicates the
presence of the frame with high certainty.
① Certainty is formalized as probability of error
② For a certain cue Cj we define error Ej
②Hypothesis testing is done by contradiction
① We assume that frame is not appropriate for the
verb and call is Ho (Null Hypothesis).
② we reject the hypothesis if Cj indicates with high
probability that our Ho is wrong.
24
Cues
Cue for frame “NP NP”
(OBJ | SUBJ_OBJ | CAP) (PUNC|CC)
•OBJ = personal pronouns(me, him)
•SUBJ_OBJ = personal pronouns(you, it)
•CAP = capitalized word
•PUNC = punctuation mark
•CC = subordinating conjunction(if, before)
[…] greet-V Peter-CAP ,-PUNC […]
25
Hypotheses testing
p E  P ( v i ( f j )  0 | C ( v i , c j )  m)
n r
    j (1   j ) n r
r m  r 
n
n : # of times vi occurs in corpus
m : # of frame f j occurs
vi(f j)=0 : Verb vi does not permit frame f j
C(vi,c j) : # of times that vi occurs
with cue c j
εj : error rate for cue f j
•  If pE < α, then we reject H0
• Precision :
– close to 100% (when α = 0.02)
• Recall : 47 ~ 100%
26
Attachment Ambiguity
Attachment ambiguity
 (8.14) The children ate the cake with a spoon.
 I saw the man with a telescope
• (Log) Likelihood Ratio [a common and good way of comparing
between two exclusive alternatives]
P( p | v)
P ( p | n)
Problem: ignores preference for attaching phrase “low” in parse
tree
 (v, n, p)  log
•
28
Simple model
• Chrysler confirmed that it would end its troubled venture with
Maserati.
29
Hindle and Rooth(1)
Event space: all V NP PP sequences,
•
How likely for a preposition to attach with a verb or noun
VAp: Is there a PP headed by p which attaches to v
NAp: Is there a PP headed by p which attaches to n
Both can be 1:
•
•
He put the book on World War II on the table
She sent him into the nursery to gather up his toys.
 (v, n, p)  log 2
 log 2
P( Attach( p)  v | v, n)
P( Attach( p)  n | v, n)
P(VAp  1 | v) P( NAp  0 | v)
P( NAp  1 | n)
C ( v, p )
C (v )
C (n, p)
P( NAp  1 | n) 
C ( n)
P(VAp  1 | v) 
30
Hindle and Rooth(2)
31
General remarks on PP attachment
Model’s limitations


Only consider the identity of the preposition, noun and the verb
Consider only the most basic case of PP immediately after an NP
object which is modifying either the immediately preceding n or v.
The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto]
[for $27 a share] [at its monthly
meeting]

Other attachment issues
• Attachment ambiguity in noun compounds
(a) [[Door bell] manufacturer] : left-branching
(b) [Woman [aid worker]] : right-branching
32
Selectional Prefences
Selectional Preferences(1)
 Selectional Preference(or Selectional restriction)
•
 Most verbs prefer arguments of a particular type.
 Preference ↔ Rules
eat + non-food argument
Example)
eating one’s words.
34
Selectional Preferences(2)
 Acquisition of selectional preference is important in Statistical
NLP for a number of reasons
Durian is missing in dictionary then we can infer part of its
meaning from selection restrictions
 Another important use is ranking the parse of a
sentence
 Give high scores to the parses where verb has natural
arguments
35
Selectional Preferences(3)
 Resnik’s Model(1993,1996)
1. Selectional Preference Strength
 How strongly the verb constrains its direct object.
 Two Assumptions
① Take only head noun
② Classes of nouns.
S (v)  D( P(C | v) || P(C ))   P(c | v) log
c
P (c | v )
P (c )
•P(C) : overall probability distribution
of noun classes
•P(C|v): probability distribution of noun
classes in the direct object position of v
36
Selectional Preferences(4)
 Table 8.5 Selectional Preference Strength
Noun class c
P( c)
P(c |eat)
P(c | see)
P(c | find)
People
0.25
0.01
0.25
0.33
Furniture
0.25
0.01
0.25
0.33
Food
0.25
0.97
0.25
0.33
Action
0.25
0.01
0.25
0.01
1.76
0.00
0.35
SPS S(v)
37
Selectional Preferences(5)
 The Notion of the Resnik’s Model (conti’)
2. Selectional Association between a verb v and a class c
P (c | v )
P (c | v) log
P (c )
A(v, c) 
S (v )
•
A rule for assigning strength to nouns
A(v, n)  max
cclasses( n )
A(v, c)
Ex) (8.31) Susan interrupted the chair.
A(interrupt, chair ) 
max
cclasses( chair)
A(interrupt , c)
 A(interrupt, people )
38
Selectional Preferences(6)
• Estimate the Probability of P(c|v)
C (v )
 P (v )  C (v ' )

= P(v,c) / P(v)
v'
1
C ( v, c )
N
1
1

C ( v, n )

N nwords( c ) | classes (n) |
P(v, c) 
N : total number of verb-object pairs in the corpus
words(c) : set of all nouns in class c
|classes(n)| : number of noun classes that contain n as a member
C(v,n) : number of verb-object pairs with v as the verb and n as the head of the
object NP
39
8.4 Selectional Preferences(7)
 Resnik’s experiments on the Brown corpus (1996) : Table 8.6
•
Left half : typical objects
•
Right half : atypical objects
•
For most verbs, association strength predicts which object is
typical
•
Most errors the model makes are due to the fact that it
performs a form of disambiguation, by choosing the highest
A(v,c) for A(v,n)
 Implicit object alternation
a. Mike ate the cake.
b. Mike ate.
◦
The more constraints a verb puts on its object, the more likely
it is to permit the implicit-object construction
◦
Selectional Preference Strength (SPS) is seen as the more
basic phenomenon which explains the occurrence of implicitobjects as well as association strength
40
Selectional Preferences(8)
41
Semantic Similarity
Semantic Similarity (1)
 Lexical Acquisition The Acquisition of meaning
 Semantic Similarity
• Automatically acquiring a relative measure of how similar a
new word is to known words is much easier than determining
what the meaning actually is
• Most often used for generalization under the assumption that
semantically similar words behave similarly
ex) Susan had never eaten a fresh durian before.
• Similarity-based Generalization VS. Class-based
Generalization
– Similarity-based generalization : Consider the closest neighbors
– Class-based generalization : Consider the whole class
• Usage of Semantic Similarity
– Query expansion : astronaut  cosmonaut
– k nearest neighbors classification
43
Semantic Similarity(2)
 A Notion of Semantic Similarity
• Extension of synonymy and refers to cases of nearsynonymy like the pair dwelling/abode
• Two words are from the Same domain or topic
ex) Doctor, nurse, fever, intravenous
• Judgements of Semantic Similarity explained by the degree
of contextual interchangeability ( Miller and Charles – 1991)
• Ambiguity presents a problem for all notions of semantic
similarity
 When applied to ambiguous words, semantically similar
usually means ‘similar to the appropriate sense’
ex) litigation ≒ suit (≠ clothes)
 Similarity Measures
• Vector space measures
• Probabilistic measures
44
Semantic Similarity(3)
45
29
Vector space measures(1)
 The two words whose semantic similarity we want to compute are
represented as vectors in a multi-dimensional space.
1. A document-by-word matrix A ( Figure 8.3 )
•
2.
Entry
i.
contains the number of times word j occurs in document
A word-by-word matrix B ( Figure 8.4 )
•
Entry
i.
contains the number of times word j co-occurs with word

3.
A modifier-by-head matrix C ( Figure 8.5 )
•
Entry
contains the number of times that head J is modified by
modifier i.
 Different spaces get at different types of semantic similarity
•
•
Document-Word, Word-Word spaces capture topical similarity
Modifier-Head space captures more fine grained similarity
46
Vector space measures(2)
3Similarity measures for binary vectors ( Table 8.7 )
 Matching coefficient simply counts the number of dimension on
which both vectors are non-zero.
X Y
 Dice coefficient normailizes for length of the vectors and the total
number of non zero entries.
2 | X Y |
| X |  |Y |
 Jaccard (or Tanimoto) coefficient penalizes a small number of
shared entries more than the Dice coefficient does.
| X Y |
| X Y |
47
Vector space measures(3)
 Similarity measures for binary vectors
(conti’)
 Overlap coefficient has a value of 1.0 if every dimension with a nonzero value for the first vector is also non-zero for the second vector.
| X Y |
min(| X |, | Y |)
 Cosine penalizes less in cases where the number of non-zero
entries is very different. | X  Y |
| X ||Y |
 Real-valued vector space
 More powerful representation for linguistic objects.
 The length of a vector
48
Vector space measures(4)
 Real-valued vector space (conti’)
 The dot product between two vectors
 The cosine measure
cos( x, y ) 
x y
| x || y |
 The Euclidean distance

n

i 1

n
2
x
i
i 1
xi yi

n
2
y
i
i 1
 The advantage of vector spaces as a representational medium.
•
•
Simplicity.
Computational efficiency.
 The disadvantage of vector spaces
 Operate on binary data except for cosine
 Cosine has its own problem
• Cosine assumes a Euclidean space
• Euclidean space is not well-motivated choice if the vectors we are
dealing with are vectors of probability or counts
49
Probabilistic measures(1)
 Transform semantic similarity into the similarity of two probability
distribution
 Transform matrices of counts in Figure 8.3, 8.4 and 8.5 into
matrices of conditional probability
Ex) (American, Astronaut)  P(American|astronaut) = ½ = 0.5
•
 Measures of (dis-)similarity between probability distributions
•
( Table 8.9 )
•
3 measures of dissimilarity between probability distributions investigated by
Dagan et al.(1997)
1. KL divergence
–
D ( p || q )   pi log
i
pi
qi
– Measures how much information is lost if we assume distribution
q when the true distribution is p
– Two Problems for Practical applications
» Get value of infinity when qi=0 and pi ≠ 0
» Asymmetric ( D(p||q) ≠ D(q||p) )
50
Probabilistic measures(2)
 Measures of similarity between probability distributions (conti’)
2. Information radius (IRAD)
–
–
–
3.
D( p ||
Symmetric and no problem with infinite values.
Measures how much information is lost if we describe the two
words that correspond to p and q with their average
distribution
norm.
–
–
pq
pq
)  D(q ||
)
2
2
1
 | pi  qi |
2 i
A measure of the expected proportion of events that are going
to be different between the distributions p and q
51
Probabilistic measures(3)
 Measures of similarity between probability distributions (conti’)
•
norm’s example ( by figure 8.5 )
p1 = P(Soviet | cosmonaut) = 0.5
p2 = 0
p3 = P(spacewalking | cosmonaut)=0.5
q1 = 0
q2 = P(American | astronaut) = 0.5
q3 = P(spacewalking | astronaut) = 0.5
1
1
 2  | pi  qi |  2 (0.5  0.5  0)  0.5
i
52
The Role of Lexical Acquisition in Statistical NLP(1)
 Lexical acquisition plays a key role in statistical NLP
because available lexical resources are always
lacking in some way.
• The cost of building lexical resources manually.
• The quantitative part of lexical acquisition almost always has
to be done automatically.
• Many lexical resources were designed for human
consumption.
 The best solution : the augmentation of a manual resource
by automatic means.
• The main reason : The inherent productivity of language.
53
What does the future
hold for lexical acquisition?
 Look harder for sources of prior knowledge that can constrain
the process of lexical acquisition.

 Much of the hard work of lexical acquisition will be in
building interfaces that admit easy specification of prior
knowledge and easy correction of mistake made in automatic
learning.
 Linguistic theory-important source of prior knowledge- has been
surprisingly underutilized in Statistical NLP.
 Dictionaries are only one source of information that can be
important in lexical acquisition in addition to text corpora.

( Other source : encyclopedias, thesauri, gazeteers,
collections of technical vocabulary etc.)
 If we succeed in emulating human acquisition of language by
tapping into this rich source of information, then a breakthrough
in the effectiveness of lexical acquisition can be expected.
54