Unsupervised induction of context free grammars

Download Report

Transcript Unsupervised induction of context free grammars

Unsupervised learning of
Natural languages
Eitan Volsky
Yasmine Meroz
Introduction
• Grammar learning methods can be
grouped into two kinds: supervised and
unsupervised.
• Roughly speaking, unsupervised methods
use only pre-tagged sentences, while
supervised methods are first initialized
with structured sentences.
• Other Forms of supervision exist as well,
for example, probabilistic grammars.
• Supervised methods clearly outperform
unsupervised ones, but they are much
more time consuming, and in many cases
it’s impossible to find a treebank or corpus,
suitable for a specific task.
• Examples :
– Deciphering text in an unknown Language
– DNA sequence analysis.
The bootstrapping process
• The process generates the syntactic
structure of a sentence while it begins
from scratch (when it’s completely
unsupervised)
• The structure has to be useful, thus
arbitrary, random or incomplete structures
are avoided.
• The system should try to minimize the
amount of the information it needs to
learn structure.
The Scope of the article
• The article presents two unsupervised learning
frameworks :
– EMILE 4.1
– ABL (Alignment-based learning)
• We’ll present the frameworks and the algorithms
that underlay them, and compare them on the
ATIS and the OVIS corpora.
EMILE 4.1
• Some definitions first :
•
The sentence : David makes tea
“David
tea”
is a “Context”
makes is an “Expression”
Substitution Classes - intuition
• If a language has a CFG then
expressions, which are generated from the
same non-terminal can substitute each
other in each context where that nonterminal is a valid constituent.
• If we have a sufficiently rich example we
can expect to find classes of expressions
that cluster together.
Primary and characteristic contexts and
expressions
• A grammatical type is defined as a pair
<TC, TE> where TC is a set of context and
TE is a set of expressions.
• Expressions and Contexts from those sets
are called primary.
• Characteristic Context for T is a context
which appears only with expressions of
type T. The same for characteristic
expressions.
Example
• “walk” can be both noun and a verb. So it
cannot be characteristic neither for noun
phrases nor for verb phrases.
• “thing” can only be a noun, thus it appears
only in noun phrases.
• “thing” is characteristic for the type
“noun”.
Shallow languages
within Chomsky Hierarchy
•Seems to be an independent category
type
0
Context
sensitive
ContextC free
regular
Shallow
Shallow languages - first criterion
• Grammar G has context separability if
each type of G has characteristic context
and expression separability if each type of
G has characteristic context.
• Shallow language has to be context and
expression separable.
Shallow languages - second criterion
• Shallow language has to have a sample
set of sentences S inducing characteristic
contexts and expressions for all types of
GL. It’s called characteristic sample.
• For all sentences of this sample set :
K(s) < log(|G|)
Kolmogorov complexity =
descriptive length of s
Why Shallow languages ?
• If the sample is taken under simple distribution
(dominated by recursively enumerable), The last
criteria promises us the sample can be learnt (its
grammar to be induced) in Polynomial Time to
|G|,
• Shallow grammars can be learned efficiently
from positive examples, what turns the
argument of poverty of stimulus, based on
Gold’s results to unconvincing.
Natural Languages are Shallow
• It is claimed (unproven) that natural
language are shallow.
• NL have large lexicons and relatively few
rules.
• Their Shallowness ensures us that if we
sample enough sentences, the sample will
be characteristic with large confidence.
How does EMILE really work ?
• Two Phases :
– Clustering
– Rule Induction
Corpus
•
•
•
•
•
•
John makes tea.
John likes coffee.
John is eating.
John likes tea.
John likes eating.
John makes coffee.
• …
• …
Clustering
context
expr’
makes
likes
is
tea
coffee
eating
John
(.)
tea
x
x
John John John
(.)
(.)
makes
coffee eating
(.)
x
x
John
likes
(.)
John
is
(.)
x
x
x
x
x
x
x
x
identification of clusters - settings
• The identification of clusters depends on
the following settings :
– Total_support%
– Context_support%
– Expression_support%
Suppose that :
Total_support% = Context_support% = 75%
Expression_support% = 80%
context
expr’
tea
John John John
makes likes drinks
(.)
(.)
(.)
x
x
x
coffee
x
x
x
lemonade
x
x
x
soup
x
x
x
apples
John
buys
(.)
x
x
x
x
Rule Induction
• T => s0 [T1] s1 [T2] s2 [T3] s3
• EMILE attempts to transform the collection of
derivation rules found into CFG, consisting of
those rules.
• [0] => what is a family fare
• [19] => a family fare.
•[0] => what is [19]
ABL
(Alignment-Based Learning)
• ABL is based on Harris’ principle of
substitutability (1951) :
All constituents of the same kind can be
replaced by each other.
ABL uses a reversed version of this principle :
If parts of sentences can be substituted by
each other, they are constituents of the
same type.
The Algorithm
• The output of algorithm is a labeled, bracketed
version of the input corpus.
• The model learns by comparing all sentences in
the input corpus to each other in pairs.
• Two Phases :
– Alignment learning
– Selection learning
A Comparison of two sentences
• The comparison of two sentences falls into one
of three different categories :
– All words in the two sentences are the same
– The sentences are completely unequal
– Some words in the sentences are same in
both and some are not.
Alignment Learning
• What is a family fare
• What is a payload of an African swallow ?
• The unequaled parts of the sentence are
possible constituents.
• {a family fare, the payload of an African
swallow}
The Edit Distance
• The edit distance is the minimum edit cost
needed to transform one sentence into another
• (Wagner & Fischer 1971)
• The algorithm which finds the edit distance
also finds the longest common subsequence,
and it also gives an estimation how far is the link
between the two parts.
Example
From (San Francisco to)1 Dallas ()2
From ()1 Dallas (to San Francisco)2
From (San Francisco)1 to (Dallas)2
From (Dallas)1 to (San Francisco)2
Overlapping Constituents
• I didn’t take my passport.
• I didn’t like this plane.
• If {this plane} was already stored, “like this
plane” overlaps with it, and we cannot
assign them different types because it
would prevent us from inducing a CFG in a
later stage.
Selection Learning
• In the Selection Learning phase, we try to get rid
of the overlapping constituents by finding the
best combination of constituents of each type.
• 3 ways to compute constituent’s probability :
– ABL : first-is-correct
– ABL : leaf
– ABL : branch
Selection Learning (cont’)
• After the probabilities of the overlapping
constituents were computed, The probability of
each combination is computed using geometric
mean, while using the Viterbi algorithm
optimization, in order to do it efficiently.
Theoretical Comparison
• ABL is much more greedy, and thus learns faster
and better on small corpora, but cannot learn on
large corpora out of efficiency reasons. It stores
all possible constituents, and only then selects
the best ones.
• EMILE is developed for large corpora (more than
100K sentences) and is much less greedy.
It finds a grammar rule only when enough
information was found to support it.
Conclusions
• Both frameworks work pretty well for
unsupervised learning models.
• Their underlying ideas match rather well.
• It should be possible to develop a hybrid
version, which uses the best qualities of
both algorithms.