for Public Service Co. of New Hampshire
Download
Report
Transcript for Public Service Co. of New Hampshire
Prepositional-phrase attachment
disambiguation using derived
semantic information and large
external corpora
thesis defense, Lena Dankin
Prepositional Phrase Attachment
blackbird fly into the light of the dark black night
Prepositional Phrase Attachment
• Prepositions are very frequent in text.
– out of the top-ten most frequent words in English,
four are prepositions (of, to, in, for).
• PP attachment is a part of the syntactic parsing
process, as it essentially affects the resulted parse
tree.
• An incorrect attachment can have a major
influence in several linguistic tasks that embed
syntactic parsing, information retrieval for
instance
Prepositional Phrase Attachment
Enraged Cow Injures Farmer With Axe
• Who uses the ax?
PP Attachment Disambiguation
• Not a purely syntactic problem
PP Attachment Disambiguation
• Sometime semantics may not be enough
Common Approaches
• Binary problem – attach the pp to the noun or
to the verb
• Semantic expansions of nouns/verbs
(olives/anchovies)
• Distributional approach
Benchmark
• RRR dataset (Ratnaparkhi, Reynar and Roukos ,1994)
• contains 27,937 quadruplets of the form:
<v, n1, p, n2>
• No sentences available
for the tuples
Benchmark
Results on RRR dataset
ALGORITHM
ACCURACY
85.3
ME (Ratnaparkhi, Reynar and Roukos)
Backoff model (Collins and Brooks,1995)
85.3
nearest neighbor (Zhao and Lin ,2004)
86.5
Our Approach
sentence
v, n1, p, n2
syntactic parser
context analyzer
generate
queries
counts
search in a large
corpus
matching
sentences
Classifier
sentences
analyzer
Data sets – WSJ-sent
• In order to obtain a sentence for each
quadruplet, as well as it context, we generated
a new data set from WSJ.
• Extraction algorithm:
– Use the gold standard parse tree for WSJ
– For each sentence, detect all prepositions
– The correct attachment is a head noun/verb or
the parent node
– Detect the second candidate
Data sets – WSJ-sent
• For example: the sentence:
the Environmental Protection Agency imposed a
gradual ban on virtually all uses of asbestos
VP
imposed
NP
a gradual
PP
ban
on
NP
NP
virtually
all uses
PP
of
asbestos
Data preprocessing
•
•
•
•
Lemmatization
Named entities extraction
Replace all digits with numbers
Convert all text to lowercase
Our Approach
sentence
v, n1, p, n2
syntactic parser
context analyzer
generate
queries
counts
search in a large
corpus
matching
sentences
Classifier
sentences
analyzer
Stanford parser
• Developed by the Stanford Natural Language
Processing Group and is considered to be reliable
and widely used
• is based on probabilistic context-free grammars
(PCFGs) and provides both POS annotation and
syntactic parse tree with its score
• Provides top K parse trees
Stanford Parser – Majority Vote
• The first parse tree attaches the PP with the
accuracy of 71.6%
• Majority vote on first 10 parse tree is accurate
in 76.6% of the times.
Stanford parser
New England Electric System bowed out of the
bidding for Public Service Co. of New Hampshire
VP
bowed
VP
bowed
out
PP
of the bidding
First parse tree
PP
out
NP
of
PP
for Public Service Co.
of New Hampshire
NP
PP
the bidding for Public Service Co.
of New Hampshire
Second parse tree
Our Approach
sentence
v, n1, p, n2
syntactic parser
context analyzer
generate
queries
counts
search in a large
corpus
matching
sentences
Classifier
sentences
analyzer
Query generations
We need to obtain statistics for the following
queries
– Candidate + p
– Candidate + p + n2
– Candidate + n2
– v + n1 + p + n2
Query expansion
• The preprocessing of the query and the BNC
increases the hits count of the query, but it’s
not enough
• We wish to expend our queries with synonyms
and semantically related words.
• <join, board, as, director>
– join => {join, get together, link}
– board => {board, committee, panel}
– Director => {director, manager}
Query expansion
• Using WordNet:
– Problem: not all query words appear in WordNet
– How to choose the correct synomym (or related
word)?
Query expansion
Compile the synonyms list for each query word:
• Senses ranking (When possible): using the Lesk
disambiguation algorithm, we choose top k (k = 5
synsets for each word) senses and expand them to
their hyponyms.
• Related candidates ranking: for each synonym/related
word, we calculate cosine similarity between the
mutual information vector representation of the
original word and the related candidate. As in the
senses selection, we take top k related words (k = 10).
Our Approach
sentence
v, n1, p, n2
syntactic parser
context analyzer
generate
queries
counts
search in a large
corpus
matching
sentences
Classifier
sentences
analyzer
BNC
• The BNC (British National Corpus) is a 100-millionword collection of samples of written and spoken
language from a wide range of sources, designed to
represent a wide cross-section of British English from
the latter part of the 20th century, both spoken and
written
• Is provided with POS annotation and the lemma for
each word
Obtaining statistics from BNC
• We wish to check how often each candidate is
attached to the preposition in the BNC
• Since we do not have a gold standard, the
classification of each example should use
heuristics
Obtaining statistics from BNC
• We split all possible examples into two
categories:
– Unambiguous case: candidate + PP should be
attached in a correct parse tree
– Ambiguous case: candidate + PP maybe attached,
but there is another possible attachment
candidate
Obtaining statistics from BNC
Unambiguous cases detection heuristics:
– n1_p:
• no verb before n1 within 5 words
• no noun, verb or preposition between n1 and p
– v_p :
• no noun, verb or preposition between v and p
– p_n2
• no noun, verb or preposition between v and p, no verb
– n1_p_n2 and v_p_n2:
• the mix of n1_p and p_n2, or v_p and p_n2, respectively
Obtaining statistics from BNC
Examples:
<control, of> (n1_p):
• First , though MI5 is notionally under the control of the Home
Secretary he will be told nothing about its day-to-day operations.
<charge, with> (v_p):
• As the direct result of Gouzenko 's information 21 people were
charged with various offences of whom 11 were convicted , two
had charges withdrawn , and eight were acquitted.
<respond, in, way> (v_p_n2):
• Latin American governments have responded in different ways ,
some taking a generally sympathetic approach through assisting
self-help schemes , as in the Peruvian case , but even these may
resort to repression .
Obtaining statistics from BNC
Ambiguous cases detection heuristics:
– n1_p:
• a verb before n1, within 5 words.
• no noun or verb between n1 and p
– v_p:
• a noun between v and p,
• no verb or preposition between v and p
– n1_p_n2 and v_p_n2:
• the mix of n1_p and p_n2, or v_p and p_n2, respectively
Obtaining statistics from BNC
Examples:
<charge, with> (v_p):
• When asked how his release had been arranged they both
charged reporters with being irresponsible and endangering
the lives of the remaining hostages.
<power, of> (n1_p):
• we must act with moderation in order to avoid increasing the
powers of the medicines to an undue extent by such
trituration .
<order, in, way> (v_p_n2):
• A military commander should order his troops in the way best
calculated to achieve victory at a minimal cost .
Our Approach
sentence
v, n1, p, n2
syntactic parser
context analyzer
generate
queries
counts
search in a large
corpus
matching
sentences
Classifier
sentences
analyzer
Combining ambiguous and
unambiguous counts
• For each ambiguous example, we parse the
sentence it was detected in
• Since Stanford parser is accurate in most cases
(more then 70% percent), we can use its vote
to disambiguate the ambiguous counts.
• We calculate the average on up to K such
sentences
Adjusted counts
c ( x) c( x)
c(i)
ie( x )
e( x ) all the expansions of x
c ( x ) unambiguous count only
c _ amb ( x) (c( x) att _ rate( x)
c(i) att _ rate(i))
ie( x )
c _ total ( x) c ( x) c _ amb ( x)
More Sentences Analysis
Intuition: For the quadruplet <join, board, as, director> :
• If pp should be attached to the verb, for ambiguous
examples of the form “join * as director”, * can be
different nouns, unrelated to “board”
• If the attachment is noun, then “board as director” is a
noun phrase, and therefore can appear with different
head verb, not only verbs that are related to “join”
Our Approach
sentence
v, n1, p, n2
syntactic parser
context analyzer
generate
queries
counts
search in a large
corpus
matching
sentences
Classifier
sentences
analyzer
Context
• For each sentence we consider a window of k
adjacent sentences
– We search for candidate + n2 within those
sentences, and calculate the distance between
them
Additional Features
• FrameNet realizations
• The presence of a possessive pronoun or by a
determiner – is likely to indicate verb
attachment
Our Approach
sentence
v, n1, p, n2
syntactic parser
context analyzer
generate
queries
counts
search in a large
corpus
matching
sentences
Classifier
sentences
analyzer
Back Off Model
Machine Learning
• We use SVM with a polynomial kernel
Results
• RRR Data set
Description
accuracy
Most likely for each preposition
72.2
Maximum entropy, words & classes (Ratnaparkhi et al., 1994)
81.2
Back off (Collins and Brooks, 1995)
84.5
Nearest-neighbor (Zhao and Lin, 2004)
86.5
Back off – counts on untagged corpus (*)
81.4
Back off – counts with expansion on untagged corpus (*)
81.1
SVM using quadruplets features only (*)
80.9
Results
• WSJ-sent
Description
accuracy
Collins an Brooks back off
85.1
Back off model – counts on untagged corpus (*)
82.3
Back off model – counts with expansion on untagged corpus (*)
82.1
SVM – only quadruplets features (*)
79.9
SVM – quadruplets and sentence features (*)
83.3
SVM – quadruplets, sentence and context features (*)
82.5
Counts are not always accurate
GE Capital has a working relationship with L.J. Hooker .
Wrong attachment: <had, with, nameperson>:
-”… he suddenly saw a solution to a theological argument
which he had with nameperson …”
Conclusions
• context – doesn’t help in most cases, and add a
lot of noise
• counts in large corpus – a lot of potential, but the
heuristics need to be improved
• Further work:
– Word2Vec for related terms extraction
– Expand the usage of FrameNet – for each v + p query,
check how well n2 matched the semantic role in the
frame
– Extract more information from the sentence –
adjective for nouns
Questions
Thank you!