First-Pass Attachment Disambiguation with Recursive Neural

Download Report

Transcript First-Pass Attachment Disambiguation with Recursive Neural

Reti Neurali Ricorsive e sistemi
connessionisti per il NLP
Il problema del First-Pass
Attachment Disambiguation
Fabrizio Costa
University of Florence
Overview
• What is this talk about:
 brief review of connectionist architectures for NLP
 introduction of a connectionist recursive system for
syntactic parsing
• Hybrid model:
 Dynamic grammar: strong incrementality hypothesis
 Recursive connectionist predictor
• Investigations:
 linguistic preferences expressed by the system
• Advances:
 enhance performance of the proposed model by
information reduction, domain partitioning
Connectionism and NLP
• Even if strongly criticized (Fodor,Pylyshyn. Connectionism
and Cognitive architecture. A critical analyses. Cognition,1998)
connectionist system are applied to a variety of
NLP tasks:




grammaticality prediction
case role assignment
syntactic parsers
multy-features system (lexical,syntactic,semantic)
3
Advantages
• Parallel satisfaction of multiple and diverse
constraints
 in contrast rule-based sequential processing needs to
compute solution that will later on be discarded if not
meeting the requirements
• Develop and process distributed
representations for complex structures
4
Connectionist Architectures
• Brief review of:
 Neural Networks
 Recurrent Networks
 Recursive Auto Associative Memory
• Introduction to Recursive Neural Networks
5
Neural Networks
• What is an Artificial Neural Network (ANN)?
• A simple model: the perceptron
inputs
x1
x2
x3
.
.
xn
w1
w2
w3
w4


y
output
y=i=1..n wi xi - 
6
What can it learn?
• A perceptron can learn linear decision
boundaries from examples
• Ex: bidimensional input
x2
x2
Classe C1
+
+
x1
+
+
Esempi
+
Classe C2
x1
+
+
+
w1 x1 + w2 x2 - =0
7
How can it learn?
• Iterative algorithm:
 at every presentation of the input the procedure
reduces the error
x2
x2
+
++ +
Iterazione 1
x2
+
x1
++ +
Iterazione 2
+
x1
++ +
x1
Iterazione 3
8
Getting more complex
• Multylayer Feed-forward Network
 hidden layer
 non linearity
Input layer
hidden layer 2
x1
x2
x3
.
.
xn
output layer 1
hidden layer 1
9
Signal Flow
• Learning takes place in 2 phases:
 Forward phase (compute prediction)
 Backward phase (propagate error signal)
Forward
x1
x2
x3
.
.
xn
Backward
10
Decision Boundaries
• Now decision boundaries can be made arbitrary
complex increasing the number of neurons in
the hidden layer.
+
+
+
+
+
+
+
11
ANN and NLP
• Since ANN can potentially learn any mapping
from input values to any desired output
• ...why not apply them to NLP problems?
• How to code linguistic information to make it
processable by ANN?
• What are the properties of linguistic information?
12
Neural Networks and NLP
• Problems:
 in the easiest formulation NLP input consists of a
sequence of variable length of tokens
 the output at any time depends on the input received
an arbitrary numer of time-steps in the past
 standard neural networks can directly process only
sequences of fixed size
• Idea:
 introduce an explicit state representation
 recycle this state as input for the next time-step
13
Recurrent Networks
Output Units
Holding Units
Hidden Units
Holding Units
Input Units
Rumelhart, Hinton, Williams.
Parallel distributed processing:
Explorations in the microstructure of Cognition.
MIT Press 1986
14
Working Principle
• A complete forward propagation (from input to output) is
1 time step
• the backward connections copy activations to Holding
Units for next time step
• at the first time step all Holding Units have a fixed
conventional value (generally all 0s)
• at a general time step:
 the Hidden Units receive input from both the Input Unit and the
appropriate Holding Units
 the Output Unit receive input from both the Hidden Unit and the
appropriate Holding Units
15
Simple Recurrent Networks
• J.L.Elman proposed and studied the properties
of a Simple Recurrent Networks applied to
linguistic problems
 Elman. Finding structure in time. Cognitive Science, 1990
 Elman. Distributed Representations, Simple Recurrent Networks and
Grammatical Structure. Machine Learning, 1991
• Simple Recurrent Networks are Recurrent
networks that
 have a single set of holding units for the hidden layer
called context layer
 backpropagation of error signal is trucated at context
layer
16
Simple Recurrent Networks
Output Units
Hidden Units
Context Layer
Input Units
17
Elman Experiment
• Task: predict next token in a sequence
representing a sentence
• Claim: in order to exhibit good prediction
performance the system has to
 learn syntactic constraints
 learn to preserve relevant information of previous
inputs in the Context Layer
 learn to discard irrelevant information
18
Elman Setting
• Simplified grammar capable of generating
embedded relative clauses




23 words plus the end of sentence marker ‘.’
number agreement
verb argument structure
interaction with relative clause
• the agent or subject in a RC is omitted
 center embedding
 viable sentences
• the end of sentence marker cannot occur at all positions in
the sentence
19
Network and Training
• Local representations for input and output
• Hidden layer with 70 units
• Training set: 4 sets of 10.000 sentences
 sets increasingly “difficult”, with a higher percentage
of sentences containing RCs
 5 epochs
20
Results
• On sentences from the training set:
 the network has learned to predict verb agreement in
number with the noun and all the other constraints
present in the grammar
• Good generalization on other sentences not
present in the training set
21
Critics
• Possibility of rote learning
 the vocabulary simbols are 23 but there are only 10
different classes (ie. ‘boy’ ‘dog’ ‘girl’ are equivalent)
 the number of disinct sentence pattern is small (there
are 400 different sentences of 3 words but only 18
different patterns if we consider the equivalences)
 there are very few sentence patterns that are in the
test set that are not in the training set
22
Critics
• The simplified error backproagation algoritm doesn’t
allow learning of long term dependencies
• the backpropagation of the error signal is trucated at the
context layer.
• This makes computation simplier (no need to store the
history of activations), and local in time
• The calculated gradient forces the network to transfer
informatioin about the input into the hidden layer only if
that information is usefull for the current output, but if it
is usefull more than one time step in the future, there is
no guarantee that it will be somehow preserved
23
Processing Linguistic data
• Linguistic Data as syntactic information is
“naturally” represented in a structured way
The servant of the actress
NP
Flat info
D
PP
N
NP
P
The
servant of
Syntactic info
D
the
N
actress
24
Processing structured data
• It is possible to “flatten” the information and transform it
in a vector representation
• BUT
• flattening makes dependencies “further” apart, more
difficult to process
A
B
C
(A(B(DEFGH)C))
DEFGH
• We need to directly process structured information!!
25
How to process structured information with a
connectionist architecture?
• Idea: recursively compress sub trees into
distributed vector representations
• Pollack. Recursive distributed representations.
Artificial Intelligence, 1990
• Autoencoder net to learn how to compress the
fields of a node to a label and uncompress the
label to the fields
26
Recursive Auto Associative Memory
RAAM
left
right
whole
left
right
27
Example
A((BC)D)
B
p
r
B
r
r
q
A
C
B
D
A
q
C
r
q
p
D
A
q
D
C
28
Training
• It is a moving target training
 when learning (B C)  r  (B C) the representation
of r changes
 this changes the target for another example (r D)  q
 (r D)
• this can cause instability so the learning rate
has to be kept small and hence we have very
long trining times
29
Coding and Decoding
• A RAAM can be viewed as 2 nets trained
simultaneously:
 an encoding network that learns the reduced representation r for
BC
 a decoding network that takes in input the reduced
representation r and decodes it to BC
• Both encoding and decoding are recursive operations
• The encoding net knows, from the structure of the tree,
how many times it must recursively compress
representations
• The decoding network has to decide if a decoded field is
a terminal node or an internal node which should be
further decoded
30
Decoding
• Pollack solution:
 use binary codes for terminal nodes
• internal representations have codes with values
in [0,1] but not close to 0 or 1
• Decision rule:
 if a decoded field has all its values close to 0 or 1
then it is a terminal node and it is not decoded further
31
Experiment
• Task: coding and decoding of compositional
prepositions
• Ex:
 Pat thought that John loved Mary
 (thought Pat (loved John (Mary)))
32
Experimental Setting
• The network has:
 48 input units
 16 hidden units
 48 output units
• Training set: 13 complex propositions
33
Results
• After training the network is able to code and
decode some (not all) novel propositions
• Cluster analysis shows that similar trees are
encoded with similar codes
 Ex: (loved John Pat) and (loved Pat Mary) are more
similar to each other than any of the codes for other
trees
34
Further results
• Pollak tried to test if a network could be trained
to manipulate representations without decoding
them
• Ex: train a network to transfor reduced
representation for (is liked X Y) into (likes Y X)
• Results:
 if trained on 12 of the 16 possible propositions
 the network generalizes correctly to the other 4
propositions
• Chalmers experiments with a network that
transforms reduced representations from
passive to active structures
35
Problems
• Generalization problems for new structures
• Very long training times
• Information storage limits:
 all configurations have to be stored in fixed-size
vector representations
 as the hight grows we have an exponential number of
possible trees
 the numerical precision limits are exceeded by small
trees
• Unknown (but believed not good) scaling
properties
36
Recursive Neural Network
• We want to
 directly process complex tree structures
 work in a supervised framework
• We use Recursive Neural Networks (RNN)
specialized for tree structures
 Frasconi, Gori, Sperduti. A general Framework for
Adaptive Processing of Data Structures. IEE
Transactions on Neural Networks, 1998
37
What is a RNN?
• Recursive Neural Networks for trees are
composed of several replicas of a Recursive
Neetwork and one Output Network
node state
Output
...
label encoding
1st child state
last child state
Recursive Network
Root state
Output Network
How does a RNN process
tree data structures?
• General processing step:
 Structure unfolding
• Prediction phase:
 Recursive state update
• Learning phase:
 Backpropagation through structure
Structure unfolding
S
VP
NP
NP
NP
PRP VBZ
It
has
DT
no
PP
NN
bearing
IN
on
Structure unfolding
S
VP
NP
NP
It
NP
has
no
bearing
PP
on
Structure unfolding
S
VP
NP
It
has
no
bearing
on
Structure unfolding
S
VP
It
has
no
bearing
on
Structure unfolding
S
It
has
no
bearing
on
Structure unfolding
It
has
no
bearing
on
Structure unfolding
Output
network
It
has
no
bearing
on
Prediction phase
Information Flow
It
has
no
bearing
on
Prediction phase
Information Flow
It
has
no
bearing
on
Prediction phase
Information Flow
It
has
no
bearing
on
Prediction phase
Information Flow
It
has
no
bearing
on
Prediction phase
Information Flow
It
has
no
bearing
on
Error Correction:
Information Flow
It
has
no
bearing
on
Error Correction:
Information Flow
It
has
no
bearing
on
Error Correction:
Information Flow
It
has
no
bearing
on
Error Correction:
Information Flow
It
has
no
bearing
on
Error Correction:
Information Flow
It
has
no
bearing
on
Encoding POS tags and non terminals
Local or one-hot encoding representation
NP
VP
PP
…
VBZ
CC
…
X
01
02
03
...0001
...0010
…0100
38
39
..010..0
..100..0
NP
PP
71
0
0
1
0
0
1
0
0
10…..0
57
Updating equations
• Let
x(v) be the state of node v
I(v) be the label (symbolic) of node v
ch[v] be the children of node v
o be the output
• then:
x(v)=f(x(ch[v]), I(v))
o=g(x(r))
• where:
f() is the state updating function
g() is the output function
r is the root node
58
Model Overview
• Hybrid system:
 Dynamic Grammar constrains the structure
 Connectionist system provides a predictor
• Linguistic framework
 Strong incrementality assumption
• Statistical framework
 Relevant structural features are adaptively encoded
by the network
 the network computes maximum-likelihood estimates
of correct disambiguation decisions
Incremental processing
• Given some syntactic left context, and a new
word, choose the tree that corresponds to the
new preferred syntactic tree
NP
D
Left context
PP
N
NP
P
The
servant of
New word
D
the
WH
N
actress
who
Dynamic grammar of connection paths
(similar to TAG/SuperTAGs)
S
NP
PRP
It
NP
VP
VBZ
has
VP
NP
NP
DT
no
NN
bearing
Connection paths combined to form trees
S
NP
PRP
It
NP
VP
VBZ
has
VP
NP
NP
DT
no
NN
bearing
The dynamic grammar generates
many legal next incremental trees
• Variability in attachment point
• Variability in structure of connection path
Attachment ambiguity
NP
D
S
PP
N
NP
P
The servant
of
S
D
N
WH
the
actress
who
Connection path ambiguity
S
NP
D
The
N
athlete
VP
V NP
realized
t
PRP
his
S’
S
NP
PRP
his
Alternatives
• Given a single correct left context and a next
word, the many legal trees that can be formed
according to the dynamic grammar are called
THE FOREST OF ALTERNATIVES
66
Learning preferences
• Task: learn P(O|I) where
O = {0,1}m
I = {T1 ,T2 ,…,Tm}
• O indicates the only correct element in the
forest of legal alternatives
 ex: o=(0,0,0,1,0,0)
• This is a dynamyc classification problem
 m varies from example to example
 typically m=120 (but goes up to 2000)
67
The learning task: forward phase
0
Ti1
Ti2
1 …. 0
……….
Learning target
TiN
Correct attachment
in training set
Negative example
Negative example
The learning task: forward phase
1 …. 0
0
.02 ……...
.95
Ti1
Ti2
……….
Learning target
.01
Proposed output
(via softmax)
TiN
Correct attachment
in training set
Negative example
Negative example
The learning task: backward phase
0
E1
Ti1
1 …. 0
E2
Ti2
……...
Learning target
Propagated error
E3
……….
TiN
(back-propagation
through structure)
Correct attachment
in training set
Negative example
Negative example
Running in test mode
.01 .97
Ti1
Ti2
.01
……….
Proposed output
(via softmax)
TiN
Given a left context and all the possible attachments
for the current input word choose the “best” attachment
Experimental setup
• Training on WSJ section of Penn treebank
 realistic corpus representative of natural language
 large size (40.000 sentence, 1 million words)
 uniform language (articles on economic subject)
72
Investigations
• Interesting question:
 How are alternatives prioritized?
 What information is the system really using?
Analyse error correlation
with structural features
• We correlate the system error with “interesting” structural features:
 number of nodes
 height
 frequency
• of:
 whole incremental tree
 connection path
• or characteristics such as outdegree or depth of:
 average node
 anchor
 root
74
Error Correlation
(Spearman Rank Correlation)
Description
Average node
Root
Anchor
Corr
Max outdegree
0,18
Avg outdegree
0,02
Avg outdegree
0,21
Depth
0,17
Incremental tree Num nodes
0,2
Height
0,19
Connection path Num nodes
0,33
Height
0,34
Frequency
-0,39
Forest
Size
0,31
Word
Index
0,19
75
Refine analyses
• Analyze and characterize incremental trees
preferred by the network
Discriminating Features
• AIM: find features of incremental trees that
discriminate between correctly classified trees
and mistakenly preferred trees
• We consider:
 true positive: correct incremental trees that are
preferred by the network
 false positives: trees preferred by the network but
that are not the correct ones
77
What leads to mistakes?
• Discover and characterize feature differences in
incremental trees when comparing the net's false
positive choice against the true element
Description
tree max outdegree
tree height
tree num nodes
cpath height
cpath num nodes
anchor outdegree
anchor depth
True Elem Fals Pos Diff/sd
7,38
30,91
2,12
3,35
7,2
30,55
1,67
2,77
0,05
0,02
0,64
0,58
78
Result
• The trees that are preferred by the network are
simplier than the correct ones
 less high
 with less nodes
 with connection paths that are
• shorter
• with fewer nodes
79
Further analyses
• Analyze separately the 2 tasks in structural
ambiguity resolution:
 Choosing the correct attachment site
 Choosing the correct connection path
80
Anchor attachment
• AIM: what are the predictive performance of the
net if we consider only the anchor attachment
problem?
• EXPERIMENT
 group all the connection paths with the same anchor
point in classes
 rank the groups according to the highest score given
by the net to any member of the group
 consider a group correctly predicted iff the correct
connection path belongs to the group
81
Experiment:
choose correct attachment site
• Network’s performance on attachment-site
prediction has an accuracy of 91.54%
Perc. correct classified
100
98
96
94
92
90
88
86
1
2
Position
3
Baseline
• AIM: How good is the previous result? Is the
network more precise than just a simple
frequency heuristic?
• Experiment:
 Rank the connectin paths groups according to the
group’s most frequent element
 Compare it against the network choice
83
Is frequency all that is being used?
• Net preference against frequency heuristic:
there is more to it
100
95
Perc Correct
90
85
80
75
70
65
60
1
2
3
Position
Net
Freq
84
What kind of statistics is the network using
to predict the attachment site?
• Claim: network is approximating human
heuristics
• Consequence: simulate heuristic and match
with network preferences
• Order alternative attachments by:
 Heuristic 1: Late Closure first, then Minimal
Attachment, then frequency of CP
 Heuristic 2: Minimal Attachment first, then Late
Closure, then frequency of CP
Results
• How many times does the choices of the system
coincides with that of the Heuristics?
Position Net=LC-MA Net=MA-LC
1
43,50%
44,50%
1&2
78,30%
61,50%
86
How is the correct CP choosen?
• Trained network was given test set with correct
attachment sites only, and was required to
choose correct connection path
Results
• Given attachment-site, correct connection-path
is chosen 89% of the time
• Almost all correctly chosen CPs are most
frequent (3,5% of the correctly chosen CPs are not the
most frequent)
88
Summary of results
• Attachment-site preferences are prioritized over
choosing frequent connection path.
• Attachment-site preferences model human
heuristics
• Connection paths are then chosen on frequency
basis
Enhancing results
• Structured data can be naturally manipulated
and made “simplier”
• The domain can be naturally partitioned in non
overlapping classes
• Idea: we could improve accuracy by means of:
 structural reduction
 modular network
90
How much information in the tree does the
network consider when making attachment
decisions?
• Claim:
 The network is not affected by structure deeply
embedded beyond the right frontier
• Consequence:
 that information is not relevant (=noise?) and could
(should!) be deleted from the tree
Reduced incremental trees
Example tree
Left context
S
NP
D
N
VP
NP NP
PP
a friend P
N
V
of Jim saw
D
N
the thief
Connection path
PP
P
with
Reduced incremental trees
Right frontier
S
NP
D
N
VP
NP NP
PP
a friend P
N
V
of Jim saw
D
N
the thief
PP
P
with
Reduced incremental trees
Right frontier + c-commanding nodes
S
NP
D
N
VP
NP NP
PP
a friend P
N
V
of Jim saw
D
N
the thief
PP
P
with
Reduced incremental trees
Right frontier + c-commanding nodes + connection path
S
NP
D
N
VP
NP
PP
a friend P
N
V
of Jim saw
D
N
the thief
PP
P
with
Reduced incremental trees
S
NP
VP
NP
V
D
N
PP
P
Experimental Setup
• Training sets:
 generated from WSJ Penn Treebank (40k sentences)
 “Full” training set with complete trees
 “Reduced” training set with reduced trees
• Testing:
 Compare the performance of the network trained on
full and reduced dataset on disambiguation task (2k
sentences)
Perc. Correctly predicted
Results
100
95
90
85
80
1
2
Reduced
3
Baseline
4
Summary of results
• Reduction of trees did not hinder performance
 Accuracy increased with reduced information
 deep information constitutes noise for the system
• The network makes little or no use of
information beyond the right frontier
 Linguistic evidence that deeply embedded
information is less accessible: Right Roof Constraint
on extraposition (Ross, 1967)
Domain partitioning
• Claim: the information needed to learn how to
disambiguate the attachment site depends on
the word that is being processed
• Ex: how to disambiguate the attachemnt of
punctuation can be learned separately from how
to attach verbs
• Conclusion: different networks would benefit
from specializoing and learning preferences for
each type of word
100
Word classes
• The information on the leaves is represented by Part Of
Speech Tags
• We group them so to obtain the following classes:









Nouns
Verbs
Prepositions
Adjectives
Adverbs
Articles
Conjunctions
Punctuation
Other
101
Experimental setup
• The WSJ 40K sentences dataset is partitioned
• A modular network is trained on the partitioned
dataset
• The test is run on the 2K partitioned test dataset
102
ive
le
it i
on
ad
ve
rb
pr
ep
os
ot
he
r
io
n
ua
t io
n
co
nj
un
ct
pu
nc
t
ad
je
ct
ar
t ic
no
un
ve
rb
Percentage Correct
100
50
90
45
80
40
70
35
60
30
50
25
40
20
30
15
20
10
10
5
0
0
Error reduction
Results
Classes
103
Summary of results
• The modular network outperforms the single
network:
 Single network: 84,82
 Modular network: 87,14
 Total error reduction: 15,3%
• Explanatory hypothesis:
 The partition of the task has lead to a better
allocation of resources in the state representation
space; now states coding for less frequent structures
of other categories don’t have to compete with more
freqent ones
104
Conclusions
• We can use connectionist system to model complex
preference task in a structured domain
• On a structured domain it is easier to investigate the
learned properties
• Due to the higly structured nature of data and of the task
we could improve accuracy by means of:
 structural reduction of data
 natural partitioning of domain
Future work
• Introduce lexical information
 Note: WSJ has 20.000 lexical items!
• Use the preference model as an informer for an
incremental parser
106
Thank you
107