Learning Random Walk Models for Inducing Word Dependency

Download Report

Transcript Learning Random Walk Models for Inducing Word Dependency

Learning Random Walk Models for
Inducing Word Dependency
Distributions
Kristina Toutanova
Christopher D. Manning
Andrew Y. Ng
Word-specific information is
important for many NLP
disambiguation tasks
Language modeling for speech
recognition
coal
I would like to make a collectcow
…. ?
call
Need a good estimate of
P(coal|collect)
P(cow|collect)
P(call|collect)
Kristina Toutanova, Chris Manning and Andrew Y.
Sparseness: 1 million words is
very little data


The number of parameters to be estimated is very
large – for a vocabulary size of 30,000 words, we
have
900 million pair-wise parameters
Pair-wise statistics involving two words are very
sparse, even on topics central to the domain of
the corpus. Examples from WSJ (1million words):
 stocks plummeted
2 occurrences
 stocks stabilized
1 occurrence
 stocks rose
50 occurrences
 stocks skyrocketed
0 occurrences
 stocks laughed
0 occurrences
Kristina Toutanova, Chris Manning and Andrew Y.
It must be possible to use
related words instead

The hypothesis is that similar words (according to various
measures) have similar distributions. Many ways of
defining related words.
stabilize
stabilized
rise
climb
skyrocket


rise
stabilizing
morphology
synonyms
“is-a”
relationship
s
Similarity with respect to some features or
distributions extracted from a corpus (examples later)
Co-occurrence in a huge corpus (Web)
Kristina Toutanova, Chris Manning and Andrew Y.
Using multiple similarity measures
and chaining inferences
stocks
rose
rise
skyrocket
skyrocketed
Kristina Toutanova, Chris Manning and Andrew Y.
Random walks in the space
of words




We propose to use a Markov Chain model in
the state space of words, connected by
different links according to their relatedness.
The target word dependency distribution is
estimated as the stationary distribution of
the Markov Chain.
This allows us to make inference based on
different relatedness concepts and to chain
inferences together.
We automatically learn weights for the
different information sources from data.
Kristina Toutanova, Chris Manning and Andrew Y.
Related work: PageRank

PageRank



A Markov Chain with transition probabilities given by

the Web graph and a manually varied probability
of
reset to uniform

With probability 1choose a link uniformly at random
from the out-links of the current page

With probability
pick a web page from the entire
collection uniformly at random
Kristina Toutanova, Chris Manning and Andrew Y.
This work



In our work, the Markov Chain transition
matrix is defined using different “link types”
.
The probability of following each link type is
learned from data.
This allows for a richer class of models
incorporating multiple knowledge sources.
Kristina Toutanova, Chris Manning and Andrew Y.
probabilities for structural
ambiguities

Structural ambiguities:
He broke [the window] [with a hammer]
He broke [the window] [with the white curtains]

Good probability estimates of P(hammer |
broke, with) and P(curtains| window, with)
will help with disambiguation
Kristina Toutanova, Chris Manning and Andrew Y.
The problem : prepositional
phrase attachment





Attachment of a phrase of a preposition,
such as with,of, in, for, etc.
Binary decision – attach to the preceding
noun or verb phrase
We represent examples by using only the
head words (the main words of each phrase)
She [broke] [the window] [with [a hammer]]
V=broke, N1=window, P=with, N2=hammer
The class variable is attachment Att=va or
Att=na
We learn a generative model P(V, N1, P, N2,
Kristina Toutanova, Chris Manning and Andrew Y.
Sparseness of distributions
Factor
% Non-Zero
Verb
Attachment
P(p|va)
99.8
P(v | va,p)
64.8
P(n1 | va,p,v)
15.7
P(n2 | va,p,v)
13.8
Noun
Attachment
P(p|na)
99.5
P(v | na,p)
69.4
P(n1 | na,p,v)
20.9
P(n2 | na,p, n1)
17.4
returning rates to normal
(va)
pushing rate of inflation
(na)
We will learn Markov
Chains whose stationary
distributions are good
estimates of these
distributions
over
Kristina Toutanova, Chris
Manning and Andrew Y.
Basic form of learned Markov
Chains
Example: v:hang n1:painting p:with n2:peg
P(n2=peg|v=hang,p=with,va)
Pwith,va(n2=peg|v=hang)
Colored states showing whether the word is a head or
a dependent.
e.g hang is a head and peg is a dependent in a withtype dependency


Links among states represent different kinds of
relatedness
hang
hook
fasten
hooks
The initial state
distribution places
probability 1 on
hang
Kristina Toutanova, Chris Manning and Andrew Y.
An example Markov Chain
Pwith(n2:peg|v:hang)
pegs
hang
empirical
distribution link
peg
fasten
rivet
We include the empirical
P(head | dependent) and
P(dependent | head) distributions
as link types
hook
hooks
Kristina Toutanova, Chris Manning and Andrew Y.
An example Markov Chain
Pwith(n2:peg|v:hang)
hang
pegs
peg
fasten
rivet
we have seen pegs as dependent
of hang in the training corpus
hook
peg and pegs have the same root
form
the probability of peg given hang hooks
Kristina Toutanova, Chris Manning and Andrew Y.
will again be high
An example Markov Chain
Pwith(n2:peg|v:hang)
pegs
hang
peg
fasten
rivet
The empirical probability of rivet
given hang is high
hook
rivet and peg are semantically
related
The probability of peg given hang hooks
will again be high
Kristina Toutanova, Chris Manning and Andrew Y.
An example Markov Chain
Pwith(n2:peg|v:hang)
pegs
hang
peg
fasten
rivet
hook
The probability of peg given hang
will again be high
hooks
Kristina Toutanova, Chris Manning and Andrew Y.
Form of learned transition
distributions



A list of links is given.
Each link is a transition distribution from
states to states and is given by a matrix Ti.
The transition probabilities of the learned
MC are given by
P ( w' , c' | w, c) 
  (w, c, i)T
i 1.. k
i
( w' , c' | w, c)
Kristina Toutanova, Chris Manning and Andrew Y.
Training the model


The model is trained by maximizing the
conditional log-likelihood of a development
set samples.
We fix the maximum number of steps in the
Markov Chain to a small number d called the
maximum degree of the walk, for
computational reasons
d
 ( s )    (1   ) t P ( S t  s )
t 0


The stopping probability
is another
trainable parameter of the MC.
Kristina Toutanova, Chris Manning and Andrew Y.
Evaluation data



Four-tuples of words and correct attachment
extracted from the WSJ part of the Penn Treebank
20,801 samples training, 4,039 development, and
3,097 test set.
Average human accuracy of 88.2% given four words
(Ratnaparkhi 94)



Many cases where the meaning is pretty much the
same
They built a house in London
Cases where it is difficult to decide from the head
words only (93% human accuracy if the whole sentence
is seen)
Accuracy 59% noun attachment 72.2% most common
for each preposition
Kristina Toutanova, Chris Manning and Andrew Y.
Baseline MC model

Only includes empirical distributions from
the training corpus. Links from head to
dependent states with the following
transition
ˆ ( n | vaprobabilities
P
, p , v ) Empirical distribution link
2
ˆ ( n | va, p )
P
2
ˆ ( n | va)
P
2
ˆ (n )
P
2
1
vSize
back-off link
back-off link
back-off link
back-off link
A random walk with these links is the
Kristina Toutanova, Chris Manning and Andrew Y.
same as the familiar linear interpolation
Accuracy of Baseline
Collins & Brooks 95 (1 NN)
84.15%
Abney et al. 99 (Boosting)
84.40%
Vanschoenwinkel et al. 03 (SVM) 84.80%
Baseline
85.85%
Human
88.20%
80
82
84
86
88
Kristina Toutanova, Chris Manning and Andrew Y.
Adding morphology and
WordNet synonyms

Morphology links for verbs and nouns –
probability of transitioning to a word with the same root
form is proportional to the smoothed empirical count of
this word

hangs
nails
hang
nail
Synonym links for verbs and nouns –
uniform
probability of transitioning to any synonym with respect
to the top three senses of a word in WordNet
fasten
nail
rivet
rivet
Kristina Toutanova, Chris Manning and Andrew Y.
Accuracy using morphology
and synonyms
Accuracy
Accuracy of Models
89
88
88
87
87
86
86
85
85
maximum
degree d=3
88.20
Baseline
86.53
86.18
85.86
Model
BL+ verb stem
+ noun stem
BL+stem+syno
nyms
Human
Error reduction from baseline wrt human
is 28.6% but significant only at level
0.1
Kristina
Toutanova, Chris Manning and Andrew Y.
Using unannotated data




Unannotated data is potentially useful for
extracting information about words.
We use the BLLIP corpus consisting of about
30 million words of automatically parsed
text (Charniak 00).
We extract 567,582 noisy examples of the
same form as training set (called BLLIP-PP).
We use this data in two ways:


Create links using the empirical distribution
in BLLIP-PP
Create distributional similarity links
Kristina Toutanova, Chris Manning and Andrew Y.
Distributional similarity links
from BLLIP-PP

Distributional similarity of heads wrt dependents and
vice versa (Dagan 99, Lee 99)
Based on Jensen-Shannon divergence of empirical
distributions
Let Q
(hang )  Pˆ ( N | va, with, hang )

with,va
2
Qwith,va ( fasten)  Pˆ ( N 2 | va, with, fasten)
Sim (hang , fasten)  exp(  JS (Q(hang ), Q( fasten))
1
JS (Q1 , Q2 )  {D(Q1 || avg(Q1 , Q2 ))  D(Q2 || avg(Q1 , Q2 ))}
2

Distributional similarity of heads wrt distribution
over another variable, not involved in the
dependency relation of interest
Kristina Toutanova, Chris Manning and Andrew Y.
Accuracy using BLLIP-PP
maximum
degree d=3
Accuracy
Accuracy of Models
89
88
88
87
87
86
86
85
85
88.20
87.54
86.53
85.86
Baseline
BL+ morph +
synonyms
BL+ verb stem
+ BLLIP links
Human
Model
Error reduction from baseline 12%,
wrt human accuracy 72%
Kristina Toutanova, Chris Manning and Andrew Y.
Summary





We built Markov Chain models defined using
a base set of links.
This allowed inference based on different
relatedness concepts and chaining
inferences together.
We automatically learned weights for the
different information sources.
Large improvements were possible in the
application of Prepositional Phrase
attachment.
Methods may be applicable to learning
random walks on the web (cf.
Kristina Toutanova, Chris Manning and Andrew Y.