Transcript Document

Using Minimum Description
Length to make Grammatical
Generalizations
Mike Dowman
University of Tokyo
What should Syntactic Theory
Explain?
• Which sentences are grammatical and
which are not
or
• How to transform observed sentences into
a grammar
learning
E
I
Children transform observed sentences (E)
Into psychological knowledge of language (I)
How should we study syntax?
Linguists’ Approach:
• Choose some sentences
• Decide on grammaticality of each one
Make a grammar that accounts for which of
these sentences are grammatical and which are
not
sentences
Informant
grammar
Linguist
Computational Linguists’ Approach
(Unsupervised Learning)
• Take a corpus
 Extract as much information from the corpus as
accurately as possible
or
 Learn a grammar that describes the corpus as
accurately as possible
corpus
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
grammar
lexical items
language model
etc.
Which approach gives more
insight into language?
Linguists tend to aim for high precision
• But only produce very limited and arbitrary
coverage
Computational linguists tend to obtain much better
coverage
• But don’t account for any body of data
completely
• And tend to learn only simpler kinds of structure
The approaches seem to be largely
complementary
Which approach gives more
insight into the human mind?
The huge size and complexity of languages
is one of their key distinctive properties
 The linguists’ approach doesn’t account
for this
So should we apply our algorithms to large
corpora of naturally occurring data?
 This won’t directly address the kind of
issue that syntacticians focus on
Negative Evidence
• Some constructions seem impossible to
learn without negative evidence
John hurt himself
Mary hurt John
John hated himself
Mary hated John
John behaved himself
* Mary behaved John
Implicit Negative Evidence
If we never hear something can’t we just assume
its not grammatical?
Sentences we never heard?
Phrases we never heard?
Verb argument constructions we never heard?
Word-affix combinations we never heard?
How often does something have to not occur
before we decide it’s not grammatical?
At what structural level do we make
generalizations?
Minimum Description Length (MDL)
MDL may be able to solve the ‘no negative
evidence’ problem
Prefers the grammar that results in the
simplest overall description of data
• So prefers simple grammars
• And grammars that allow simple
descriptions of the data
Space of possible sentences
Observed sentences
Space of possible sentences
Simple but non-constraining grammar
Observed sentences
Grammar
Space of possible sentences
Simple but non-constraining grammar
Observed sentences
Grammars
Complex but constraining grammar
Space of possible sentences
Simple but non-constraining grammar
Observed sentences
Grammars
Complex but constraining grammar
Grammar that is a good fit to the data
Why it has to be MDL
Many machine learning techniques have
been applied in computational linguistics
MDL is very rarely used
Not especially successful at learning
grammatical structure from corpora
So why MDL?
Maximum Likelihood
Maximum likelihood can be seen as a special case
of MDL in which the a priori probability of all
hypotheses P(h) is equal
But the hypothesis that only the observed
sentences are grammatical will result in the
maximum likelihood
So ML can only be applied if there are restrictions
on how well the estimated parameters can fit the
data
The degree of generality of the grammars is set
externally, not determined by the Maximum
Likelihood principle
Maximum Entropy
Make the grammar as unrestrictive as
possible
But constraints must be used to prevent a
grammar just allowing any combination of
words to be a grammatical sentence
Again the degree of generality of
grammars is determined externally
Neither Maximum Likelihood nor Maximum
Entropy provide a principle that can decide
when to make generalizations
Learning Phrase Structure
Grammars
1 S  NP VP
2 NP  John
3 NP  Mary
4 VP  screamed
5 VP  died
Data:
John screamed
John died
Mary Screamed
Describing data in terms of the grammar:
2, 4 = John screamed
There is a restricted range of choices at each
stage of the derivation
 Fewer choices = higher probability
1,
Encoding in My Model
Number of bits decoded = evaluation
Decoder
1010100111010100101101010001100111100011010110
Symbol
Frequencies
S (1)
NP (3)
VP (3)
john (1)
mary (1)
screamed (1)
died (1)
null (4)
1 S  NP VP
2 NP  john
3 NP  mary
4 VP  screamed
5 VP  died
Grammar
Rule 1  3
Rule 2  2
Rule 3  1
Rule 4  2
Rule 5  1
Rule
Frequencies
John screamed
John died
Mary Screamed
Data
Example: English
John hit Mary
Mary hit Ethel
Ethel ran
John ran
Mary ran
Ethel hit John
Noam hit John
Ethel screamed
Mary kicked Ethel
John hopes Ethel thinks Mary hit Ethel
Ethel thinks John ran
John thinks Ethel ran
Mary ran
Ethel hit Mary
Mary thinks John hit Ethel
John screamed
Noam hopes John screamed
Mary hopes Ethel hit John
Noam kicked Mary
Learned Grammar
S  NP VP
VP  ran
VP  screamed
VP  Vt NP
VP  Vs S
Vt  hit
Vt  kicked
Vs  thinks
Vs  hopes
NP  John
NP  Ethel
NP  Mary
NP  Noam
Real Language Data
Can the MDL metric also learn grammars
from corpora of unrestricted natural
language?
If it could, we’d largely have finished syntax
But search space is way too big
We need to simplify the task in some way
Only learn verb subcategorization classes
Switchboard Corpus
( (S
(C C and)
(PRN
Extracted Information:
Verb: spent
(, ,)
(S
(NP -SBJ (PRP you) )
( VP (VBP know) ))
(, ,) )
( NP-SBJ-1 (PRP she) )
(VP (VBD spent)
(NP
( NP (CD nine) (NNS months) )
( PP (IN out)
(PP (IN of)
(NP (DT the) (NN year) ))))
(S-ADV
(NP -SBJ (-NONE- *-1) )
(A DVP (RB just) )
( VP (VBG visiting)
(NP (PRP$ her) (NNS children) ))))
(. .) (-DFL- E_S) ))
Subcategorization frame: * NP S
Extracted Data
Only verbs tagged as VBD (past tense) extracted
Modifiers to basic labels ignored
21,759 training instances
704 different verbs
706 distinct subcategorization frames
25 different types of constituent appeared
alongside the verbs (e.g. S, SBAR, NP, ADVP)
Verb Class Grammars
S  Class1 Subcat1
S  Class1 Subcat2
S  Class2 Subcat1
Class1  grew
Class1  ended
Class2  did
grew and ended appear can appear with subcats 1 and 2
do only with subcat 2
Grouping together verbs with similar subcategorizations
should improve the evaluation
A New Search Mechanism
We need a search mechanism that will only
produce candidate grammars of the right form
• Start with all verbs in one class
• Move a randomly chosen verb to a new class
(P=0.5) or a different class (P=0.5)
• Empty verb classes are deleted
• Redundant rules are removed
A New Search Mechanism (2)
Annealing search:
• After no changes are accepted for 2,000
iterations switch to merging phase
• Merge two randomly selected classes
• After no changes accepted for 2,000 iterations
switch back to moving phase
• Stop after no changes accepted for 20,000
iterations
• Multiple runs were conducted and the grammar
with the overall lowest evaluation selected
Grammar Evaluations
One verb
class
Each verb in a
separate class
Best learned
grammar
Overall
Evaluation
Grammar
250,435.5
298,063.3
245,198.0
29,915.1
111,036.5
37,885.5
Data
220,520.4
187,026.7
207,312.4
Learned Classes
Class
1
Verbs in Class
Description
thought, vowed, prayed, decided, adjusted, wondered,
wished, allowed, knew, suggested, claimed, believed,
remarked, resented, detailed, misunderstood, assumed,
competed, snowballed, smoked, said, struggled, determined,
noted, understood, foresaw, expected, discovered, realized,
negotiated, suspected, indicated
enjoyed, canceled, liked, had, finished, traded, sold, ruined,
needed, watched, loved, included, received, converted,
rented, bred, deterred, increased, encouraged, made,
swapped, shot, offered, spent, impressed, discussed, missed,
carried, injured, presented, surprised…
Usually take S or
S BAR complement
(S BAR usually
contains that or who
etc. followed by an S)
3
did
did only
4
All other verbs
miscellaneous
5
used, named, tried, considered, tended, refused, wanted,
managed, let, forced, began, appeared
Typically take an S
argument (but never
just an SBAR)
6
wound, grew, ended, closed, backed
Usually take a particle
2
Usually take an NP
argument (often in
conjunction with other
arguments)
Did MDL make appropriate
generalizations?
The learned verb classes are clearly linguistically
coherent
But they don’t account for exactly which verbs can
appear with which subcats
Linguists have proposed far more fine-grained
classes
Data available for learning was limited (subcats had
no internal structure, Penn Treebank labels may not
be sufficient)
 But linguists can’t explain which verbs appear with
which subcats either
Conclusions
• MDL (and only MDL) can determine when
to make linguistic generalizations and
when not to
• The same MDL metric can be used both
on small sets of example sentences and
on unrestricted corpora
• Work using corpora does not address the
kind of issues that syntacticians are
interested in