Transcript Document

Learning Syntax with Minimum
Description Length
Mike Dowman
9 March 2006
Syntactic Theory
Chomsky (1957):
• Discovery procedure
• Decision procedure
• Evaluation procedure
Chomsky (1965):
Language acquisition
Poverty of the Stimulus
• Evidence available to children is
utterances produced by other speakers
• No direct cues to sentence structure
• Or to word categories
So children need prior knowledge of
possible structures
UG
Negative Evidence
• Some constructions seem impossible to
learn without negative evidence
John gave a painting to the museum
John gave the museum a painting
John donated a painting to the museum
* John donated the museum a painting
Implicit Negative Evidence
If constructions don’t appear can we just assume
they’re not grammatical?
 No – we only see a tiny proportion of possible,
grammatical sentences
 People generalize from examples they have
seen to form new utterances
‘[U]nder exactly what circumstances does a child
conclude that a nonwitnessed sentence is
ungrammatical?’ (Pinker, 1989)
Minimum Description Length (MDL)
MDL may be able to solve the poverty of the
stimulus problem
Prefers the grammar that results in the
simplest overall description of data
• So prefers simple grammars
• And grammars that result in simple
descriptions of the data
Simplest means specifiable using the least
amount of information
Space of possible sentences
Simple but non-constraining grammar
Observed sentences
Inferred grammars
Complex but constraining grammar
Grammar that is a good fit to the data
MDL and Bayes’ Rule
P(h | d )  P(h) P(d | h)
• h is a hypothesis (= grammar)
• d is some data (= sentences)
The probability of a grammar given some
data is equal to its a priori probability time
how likely the observed sentences would
be if that grammar were correct
Probabilities and Complexities
• Information theory relates probability P
and amount of information I
I = -log2 P
It takes less information to encode likely
events compared to unlikely events
P(h | d )  2
 ( I ( h )  I ( d |h ))
The best grammar is the one that allows
both itself and the data to be encoded
using the least amount of information
Complexity and Probability
• More complex grammar
lower probability
• More restrictive grammar
less choices for data, so each possibility
has a higher probability
MDL finds a middle ground between always
generalizing and never generalizing
Encoding Grammars and Data
Decoder
1010100111010100101101010001100111100011010110
Grammar
Data coded in terms of grammar
ABC
The comedian died
BDE
A kangaroo burped
E  {kangaroo, aeroplane, comedian}
The aeroplane laughed
D  {the, a, some}
Some comedian burped
C  {died, laughed, burped}
MDL and Prior (Innate?) Bias
• MDL solves the difficult problem of
deciding prior probability for each
grammar
• But MDL is still subjective – the prior bias
is just hidden in the formalism chosen to
represent grammars, and in the encoding
scheme
MDL in Linguistics
• Morphology (John Goldsmith, 2001;
Michael Brent, 1993)
• Phonology (Mark Ellison, 1992)
• Syntax (Andreas Stolcke, 1994; Langley
and Stromsten, 2000; Grünwald 1994;
Onnis, Roberts and Chater, 2002)
• Iterated Learning (Teal and Taylor, 2000;
Brighton and Kirby, 2001; Roberts, Onnis
and Chater, 2005)
Learning Phrase Structure
Grammars: Dowman (2000)
• Binary or non-branching rules:
SB C
BE
C  tomato
• All derivations start from special symbol S
Encoding Grammars
Grammars can be coded as lists of three
symbols
• null symbol in 3rd position indicates a nonbranching rule
First symbol is rules left hand side, second
and third its right hand side
S, B, C, B, E, null, C, tomato, null
Statistical Encoding of Grammars
• First we encode the frequency of each
symbol
• Then encode each symbol using the
frequency information
S, B, C, B, E, null, C, tomato, null
I(S) = -log 1/9
I(null) = -log 2/9
 Uncommon symbols have a higher
coding length than common ones
Encoding Data
1 S  NP VP
2 NP  john
3 NP  mary
4 VP  screamed
5 VP  died
Data:
John screamed
John died
Mary Screamed
Data encoding: 1, 2, 4, 1, 2, 5, 1, 3, 4
There is a restricted range of choices at each
stage of the derivation
 Fewer choices = higher probability
Statistical Encoding of Data
If we record the frequency of each rule, this information can
help us make a more efficient encoding
•
•
•
•
•
1 S  NP VP (3) Total frequency for S = 3
2 NP  john (2)
Total frequency for NP = 3
3 NP  mary (1)
4 VP  screamed (2)
Total frequency for VP = 3
5 VP  died (1)
Data: 1, 2, 4, 1, 2, 5, 1, 3, 4
Probabilities: 1  3/3, 2  2/3, 4  2/3,
1  3/3, 2  2/3…
Encoding in My Model
Number of bits decoded = evaluation
Decoder
1010100111010100101101010001100111100011010110
Symbol
Frequencies
S (1)
NP (3)
VP (3)
john (1)
mary (1)
screamed (1)
died (1)
null (4)
1 S  NP VP
2 NP  john
3 NP  mary
4 VP  screamed
5 VP  died
Grammar
Rule 1  3
Rule 2  2
Rule 3  1
Rule 4  2
Rule 5  1
Rule
Frequencies
John screamed
John died
Mary Screamed
Data
Creating Candidate Grammars
• Start with simple grammar that allows all
sentences
• Make simple change and see if it improves
the evaluation (add a rule, delete a rule,
change a symbol in a rule, etc.)
• Annealing search
• First stage: just look at data coding length
• Second stage: look at overall evaluation
Example: English
John hit Mary
Mary hit Ethel
Ethel ran
John ran
Mary ran
Ethel hit John
Noam hit John
Ethel screamed
Mary kicked Ethel
John hopes Ethel thinks Mary hit Ethel
Ethel thinks John ran
John thinks Ethel ran
Mary ran
Ethel hit Mary
Mary thinks John hit Ethel
John screamed
Noam hopes John screamed
Mary hopes Ethel hit John
Noam kicked Mary
Learned Grammar
S  NP VP
VP  ran
VP  screamed
VP  Vt NP
VP  Vs S
Vt  hit
Vt  kicked
Vs  thinks
Vs  hopes
NP  John
NP  Ethel
NP  Mary
NP  Noam
Evaluation (bits)
Evaluations
450
400
350
300
250
200
150
100
50
0
Overall
Evaluation
Grammar
Data
Initial
Grammar
Learned
Grammar
Dative Alternation
John gave a painting to the museum
John gave the museum a painting
John donated a painting to the museum
* John donated the museum a painting
• Two sub-categories of verb
• Productive use of new verbs
• U-shaped learning
Training Data
• Three alternating verbs: gave, passed, lent
• One non-alternating verb: donated
• One verb seen only once: sent
The museum lent Sam a painting
John gave a painting to Sam
Sam donated John to the museum
The museum sent a painting to Sam
Dative Evaluations
Evaluation (bits)
3500
3000
2500
Overall
Evaluation
Grammar
2000
1500
1000
Data
500
0
Initial
Grammar
Learned
Grammar
Learned Structures
S
X
Y
Z
NP
NP
VA
DET
John
gave
a
N
painting
P
NP
to
Sam
But all sentences generated by the grammar are grammatical
Two Verb Classes Learned
• Learned grammar distinguishes alternating
and non-alternating verbs
A grammar with one verb class would be
simpler
So why two classes?
A Grammar with one Verb Class
donated was placed in the same class as
the other verbs and redundant rule deleted
donated
doesn’t
alternate
donated
alternates
Overall
Evaluation
(bits)
1703.4
1710.4
Grammar
(bits)
321.0
298.2
Data (bits)
1382.3
1412.2
The new grammar is simpler
But predicts many ungrammatical sentences
U-shaped Learning
• With less data, a grammar with only one
class of verbs was learned (so donated
could appear in both constructions)
• In this case the benefit derived from a
better description of the data was not
enough to justify a more complex grammar
So a simpler grammar was preferred in
this case
Regular and Irregular Rules
• The model places sent the alternating class. Why?
Y  VA NP
Y  VA Z
Y  VP Z
VA  passed
VA  gave
VA  lent
VP  donated
VA / VP  sent
sent doesn’t
alternate
sent
alternates
Overall
Evaluation
(bits)
1703.6
1703.4
Grammar
(bits)
322.2
321.0
Data (bits)
1381.4
1382.3
Regular constructions are preferred
because the grammar is coded statistically
Why use Statistical Grammars?
Statistics are a valuable source of information
 They help to infer when absences are due to
chance
The learned grammar predicted that sent should
appear in the double object construction
• but in 150 sentences it was only seen in the
prepositional dative construction
• With a non statistical grammar we need an
explanation as to why this is
• A statistical grammar knows that sent is rare,
which explains the absence of double object
occurrences
Non-statistical Coding of Grammars
and Non-statistical Grammars
Status of rules in the
grammar
Nature of encoding
of grammar
statistical
statistical
non-statistical
statistical
non-statistical
non-statistical
Component of
evaluation
Grammar in which sent
does not alternate
Grammar in which sent
alternates
Total
1703.6
1703.4
Grammar
322.2
321.0
Data
1381.4
1382.3
Total
1734.2
1735.2
Grammar
352.8
352.8
Data
1381.4
1382.3
Total
1653.9
1659.0
Grammar
227.2
226.0
Data
1426.7
1433.0
Total
1684.6
1690.8
Grammar
257.8
257.8
Data
1426.7
1433.0
Sent only alternates when grammar is both statistical and
encoded statistically
Is this really how alternating verbs
are learned?
• Change in possession verbs (send, give)
alternate
• Unless they are Latinate (donate)
Children are aware of such cues (Gropen
et al, 1989)
So a complete theory of the dative
alternation must take account of semantic
and morphophonological cues
• In MDL rules can be conditioned based on
semantic or phonological information
Does Pinker’s / Mazurkewich and
White’s Solution Still need MDL?
• Correspondence between
morphophonological/semantic cues and
subcategorizations is language specific
Must be learned from distributions
• Which semantic/phonological cues are
relevant?
• Which are due to chance similarity?
The same kind of issues resurface as with a
purely distributional account
Is the Learnability Problem Solved?
• MDL can learn very simple grammars for
small data-sets
• No-one’s succeeded in scaling it up
A second learnability problem arises from
the impossibility of considering all possible
grammars
• Do we need innate constraints if the
search for a correct grammar is going to
be successful?
Alternative Grammar Formalisms
Phrase structure grammars are too simple to
capture some phenomena
• Agreement
• Movement
• etcetera
But MDL is compatible with other
grammar formalisms
Neurologically and Psychologically
Plausible?
Take Home Messages
• Example sentences are a rich source of
evidence to linguistic structure
• Statistical information is very valuable – so
don’t ignore it
• Maybe syntax is more learned than innate
• MDL tells us which generalizations are
appropriate (justified by the data) and
which are not
• Lack of negative evidence is not a
particular problem when using MDL
References
•
•
•
•
•
•
•
Brent, M. (1993). Minimal Generative Explanations: A Middle Ground
between Neurons and Triggers. Proceedings of the 15th Annual Conference
of the Cognitive Science Society. Hillsdale, N.J.: Lawrence Erlbaum
Associates.
Brighton, H. & Kirby, S. (2001). The Survival of the Smallest: Stability
Conditions for the Cultural Evolution of Compositional Language. In J.
Kelemen & P. Sosík (Eds.) Advances in Artificial Life. Berlin: Springer.
Ellison, T. M. (1992). The Machine Learning of Phonological Structure.
Doctor of Philosophy thesis, University of Western Australia.
Chomsky, N. (1957). Syntactic Structures. The Hague: Mouton & Co.
Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT
Press.
Goldsmith, J. (2001). Unsupervised Learning of the Morphology of a Natural
Language. Computational Linguistics 27(2): 153-198.
Gropen, J., Pinker, S., Hollander, M., Goldberg, R. & Wilson, R. (1989). The
Learnability and Acquisition of the Dative Alternation in English. Language,
65, 203-257.
•
•
•
•
•
•
•
Grünwald, P. (1994). A minimum description length approach to grammar
inference. In G. Scheler, S. Wernter, and E. Riloff, (Eds.), Connectionist,
Statistical and Symbolic Approaches to Learning for Natural Language.
Berlin: Springer Verlag.
Langley, P., & Stromsten, S. (2000). Learning context-free grammars with a
simplicity bias. In R. L. de Mantaras, and E. Plaza, (Eds.) Proceedings of
the Eleventh European Conference on Machine Learning. Barcelona:
Springer-Verlag.
Onnis, L., Roberts, M. and Chater, N. (2002). Simplicity: A cure for
overgeneralizations in language acquisition? In W. Gray and C. Schunn
(Eds.) Proceedings of the Twenty-Fourth Annual Conference of the
Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum Associates.
Pinker, S. (1989), Learnability and Cognition: the Acquisition of Argument
Structure. Cambridge, MA: MIT Press.
Roberts, M., Onnis, L., & Chater, N. (2005). Acquisition and evolution of
quasi-regular languages: two puzzles for the price of one. In Tallerman, M.
(Ed.) Language Origins: Perspectives on Evolution. Oxford: Oxford
University Press.
Stolcke, A. (1994). Bayesian Learning of Probabilistic Language Models.
Doctoral dissertation, Department of Electrical Engineering and Computer
Science, University of California at Berkeley.
Teal, T. K. & Taylor, C. E. (2000). Effects of Compression on Language
Evolution. Artificial Life, 6: 129-143.