Transcript ROOT

Gibbs Sampling with Treenes
constraint in Unsupervised
Dependency Parsing
David Mareček and Zdeněk Žabokrtský
Institute of Formal and Applied Linguistics
Charles University in Prague
September 15, 2011, Hissar, Bulgaria
Motivations for unsupervised parsing

We want to parse texts for which we do not have any manually
annotated treebanks



We want to learn sentence structures from the corpus only



texts from different domains
different languages
What if the structures produced by linguists are not suitable for NLP?
Annotations are expensive
It’s a challenge:

can we beat the supervised techniques in some application?
Outline

Parser description




Sampling constraints




Treeness
Root fertility
Noun-root dependency repression
Evaluation



Priors
Models
Sampling
on Czech treebank
on all 19 CoNLL treebanks from shared task 2006-2007
Conclusions
Basic features of our approach



Learning is based on Gibbs sampling
We approximate probability of a tree by a product of probabilities
of individual edges
We used only POS tags for predicting a dependency relation



but we plan to use lexicalization and unsupervised POS tagging in the
future
We introduce treeness as a hard constraint in the sampling
procedure
It allows non-projective edges
Models

We use two simple models in our experiments

the parent POS tag conditioned by the child POS tag

the edge length (signed distance between the two words) conditioned by
the child POS tag
Gibbs sampling



We sample each dependency edge independently
50 iterations
The rich get richer (self-reinforcing behavior)


Exchangability



counts are taken from the history
we can deal with each edge as it was the last one in the corpus
nominators and denominators in the product are exchangable
Dirichlet hyperparameters α1 α2 were set experimentally
Basic sampling



For each node, sample its parent with respect to the probability
distribution
The sampling order of the nodes is random
Problem: it may create cycles and discontinuous graphs
0.01
ROOT
0.02
0.05
0.04
0.03
0.05
0.07
Její
dcera
byla
včera
v
zoologické
5
3
2
6
7
1
zahradě.
4
Treeness constraint

In case a cycle is created:


choose one edge in the cycle (by sampling) and delete it
take the formed subtree and attach it to one of the remaining nodes (by
sampling)
0.02
0.01
0.02
0.04
0.02
0.02
0.02
ROOT
Její
dcera
byla
včera
v
0.05
zoologické
zahradě.
Root fertility constraint




Individual phrases tend to be attached to the technical root
A sentence has usualy only one word (the main verb) that
dominate the others
We constrain the root fertility to be one
If it has more than one child, we do the resampling


sample one child that will stay under the root
resample parents of other children
0.05
ROOT
0.02
0.03
Její
dcera
byla
0.04
včera
0.04
v
0.02
0.01
zoologické
0.02
zahradě.
Noun-ROOT dependency repression

Nouns (especially subjects) often substitute verbs in the governing
positions.
Majority of grammars are verbocentric
Nouns can be easily recognized as the most frequent coarsegrained tag category in the corpus
We add the following model:

This model is useless when an unsupervised POS tagging is used



Evaluation measures

Evaluation of unsupervised parser on GOLD data is problematic



We use three following measures:




many linguistics decisions must have been done before annotating each
corpus
how to deal with coordination structures, auxiliary verbs, prepositions,
subordinating conjunctions?
UAS (unlabeled attachment score) – standard metric for evaluating
dependency parsers
UUAS (undirected unlabeled attachment score) – edge direction is
disregarded (it is not a mistake if governor and dependent are switched)
NED (neutral edge direction, Schwartz et al, 2011) which treats not only a
node’s gold parent and child as the correct answer, but also its gold
grandparent
UAS < UUAS < NED
Evaluation on Czech



Czech dependency treebank from CoNLL 2007 shared task
Punctuation removed
max 15-word sentences
Configuration
UAS
UUAS NED
Random baseline
12.0
19.9
27.5
LeftChain baseline
30.2
53.6
67.2
RightChain baseline
25.5
52.0
60.6
Base
36.7
50.1
55.1
Base+Treeness
36.2
46.6
50.0
Base+Treeness+RootFert
41.2
58.6
70.8
Base+Treeness+RootFert+NounRootRepression
49.8
62.6
73.0
Error analysis for Czech

Many errors are caused by the reversed dependencies


preposition – noun
subordinating conjunction – verb
Evaluation on 19 CoNLL languages






We have taken the dependency treebanks from CoNLL shared
tasks 2006 and 2007
POS tags from the fifth column were used
The parsing was run on concatenated trainining and
development sets
Punctuation was removed
Evaluation on the development sets only
We compare our results with the state-of-the-art system, which
is based on DMV (Spitkovsky et al, 2011)
Evaluation on 19 CoNLL languages
Conclusions




We introduced a new approach to unsupervised dependency
parsing
Even though only a couple of experiments were done so far and
only POS tags with no lexicalization are used, the results seem
to be competitive to the state-of-the-art unsuperrvised parsers
(DMV)
We have better UAS for 12 languages out of 19
If we do not use noun-root dependency repression, which is
useful only with supervised POS tags, we have better scores for
7 languages out of 19
Future work


We would like to add:
Word fertility model


Lexicalization


to model number of children for each node
the word forms itself must be useful
Unsupervised POS taging

some recent experiments show that using word classes instead of
supervised POS tags can improve the parsing accuracy
Thank you for your attention.