Minimum Description Length - Linguistics and English Language

Download Report

Transcript Minimum Description Length - Linguistics and English Language

Minimum Description Length
An Adequate Syntactic Theory?
Mike Dowman
3 June 2005
Linguistic Theory
Primary
Linguistic
Data
Language
Acquisition
Device
Individual's
Knowledge
of Language
Chomsky’s Conceptualization of Language Acquisition
Diachronic Theories
Language
Acquisition
Device
Individual's
Knowledge of
Language
Primary
Linguistic
Data
Arena of
Language
Use
Hurford’s Diachronic Spiral
Learnability
Poverty of the stimulus
Language is really complex
Obscure and abstract rules constrain, whmovement, pronoun binding, passive
formation, etc.
Examples of E-language don’t give
sufficient information to determine this
WH-movement
Whoi do you think Lord Emsworth will invite ti?
Whoi do you think that Lord Emsworth will invite ti?
Whoi do you think ti will arrive first?
* Whoi do you think that ti will arrive first?
Negative Evidence
• Some constructions seem impossible to
learn without negative evidence
John gave a painting to the museum
John gave the museum a painting
John donated a painting to the museum
* John donated the museum a painting
Implicit Negative Evidence
If constructions don’t appear can we just assume
they’re not grammatical?
 No – we only see a tiny proportion of possible,
grammatical sentences
 People generalize from examples they have
seen to form new utterances
‘[U]nder exactly what circumstances does a child
conclude that a nonwitnessed sentence is
ungrammatical?’ (Pinker, 1989)
Learnability Proofs
Gold (1967) for languages to be learnable in
the limit we must have:
• Negative evidence
• or a priori restrictions on possible
languages
But learnable in the limit means being sure
that we have determined the correct
language
Statistical Learnability
Horning (1969)
• If grammars are statistical
• so utterances are produced with
frequencies corresponding to the grammar
 Languages are learnable
• But we can never be sure when the
correct grammar has been found
• This just gets more likely as we see more
data
Horning’s Proof
• Used Bayes’ rule
P(h | d )  P(h) P(d | h)
• More complex grammars are less probable a
priori P(h)
• Statistical grammars can assign probabilities to
data P(d | h)
• Search through all possible grammars, starting
with the simplest
MDL
Horning’s evaluation method for grammars
can be seen as a form of Minimum
Description Length
 Simplest is best (Occam’s Razor)
 Simplest means specifiable with the least
amount of information
Information theory (Shannon, 1948) allows
us to link probability and information:
Amount of information = -log Probability
Encoding Grammars and Data
Decoder
1010100111010100101101010001100111100011010110
Grammar
Data coded in terms of grammar
ABC
The comedian died
BDE
A kangaroo burped
E  {kangaroo, aeroplane, comedian}
The aeroplane laughed
D  {the, a, some}
Some comedian burped
C  {died, laughed, burped}
Complexity and Probability
• More complex grammar
Longer coding length, so lower probability
• More restrictive grammar
Less choices for data, so each possibility
has a higher probability
• Most restrictive grammar just lists all
possible utterances
Only the observed data is grammatical, so
it has a high probability
• A simple grammar could be made that
allowed any sentences
Grammar would have a high probability
But data a very low one
MDL finds a middle ground between always
generalizing and never generalizing
Rampant Synonymy?
• Inductive inference (Solomonoff, 1960a)
• Kolmogorov complexity (Kolmogorov, 1965)
• Minimum Message Length (Wallace and
Boulton, 1968)
• Algorithmic Information Theory (Chaitin, 1969)
• Minimum Description Length (Rissanen, 1978)
• Minimum Coding Length (Ellison, 1992)
• Bayesian Learning (Stolcke, 1994)
• Minimum Representation Length (Brent, 1996)
Evaluation and Search
• MDL principle gives us an evaluation
criterion for grammars (with respect to
corpora)
• But it doesn’t solve the problem of how to
find the grammars in the first place
 Search mechanism needed
Two Learnability Problems
• How to determine which of two or more
grammars is best given some data
• How to guide the search for grammars so
that we can find the correct one, without
considering every logically possible
grammar
MDL in Linguistics
• Solomonoff (1960b): ‘Mechanization of
Linguistic Learning’
• Learning phrase structure grammars for
simple ‘toy’ languages: Stolcke (1994),
Langley and Stromsten (2000)
• Or real corpora: Chen (1995), Grünwald
(1994)
• Or for language modelling in speech
recognition systems: Starkie (2001)
Not Just Syntax!
• Phonology: Ellison (1992), Rissanen and
Ristad (1994)
• Morphology: Brent (1993), Goldsmith
(2001)
• Segmenting continuous speech: de
Marcken (1996), Brent and Cartwright
(1997)
MDL and Parameter Setting
• Briscoe (1999) and Rissanen and Ristad
(1994) used MDL as part of parameter
setting learning mechanisms
MDL and Iterated Learning
• Briscoe (1999) used MDL as part of an
expression-induction model
• Brighton (2002) investigated effect of
bottlenecks on an MDL learner
• Roberts et al (2005) modelled lexical
exceptions to syntactic rules
An Example: My Model
Learns simple phrase structure grammars
• Binary or non-branching rules:
AB C
DE
F  tomato
• All derivations start from special symbol S
• null symbol in 3rd position indicates nonbranching rule
Encoding Grammars
Grammars can be coded as lists of three
symbols
• First symbol is rules left hand side, second
and third its right hand side
A, B, C, D, E, null, F, tomato, null
• First we have to encode the frequency of
each symbol
Encoding Data
1 S  NP VP (3) Total frequency for S = 3
2 NP  john (2)
3 NP  mary (1) Total frequency for NP = 3
4 VP  screamed (2)
Total frequency for VP = 3
5 VP  died (1)
Data: 1, 2, 4, 1, 2, 5, 1, 3, 4
Probabilities: 1  3/3, 2  2/3, 4  2/3,
1  3/3, 2  2/3…
We must record the frequency of each rule
Encoding in My Model
Decoder
1010100111010100101101010001100111100011010110
Symbol
Frequencies
S (1)
NP (3)
VP (3)
john (1)
mary (1)
screamed (1)
died (1)
null (4)
1 S  NP VP
2 NP  john
3 NP  mary
4 VP  screamed
5 VP  died
Grammar
Rule 1  3
Rule 2  2
Rule 3  1
Rule 4  2
Rule 5  1
Rule
Frequencies
John screamed
John died
Mary Screamed
Data
Search Strategy
• Start with simple grammar that allows all
sentences
• Make simple change and see if it improves
the evaluation (add a rule, delete a rule,
change a symbol in a rule, etc.)
• Annealing search
• First stage: just look at data coding length
• Second stage: look at overall evaluation
Example: English
John hit Mary
Mary hit Ethel
Ethel ran
John ran
Mary ran
Ethel hit John
Noam hit John
Ethel screamed
Mary kicked Ethel
John hopes Ethel thinks Mary hit Ethel
Ethel thinks John ran
John thinks Ethel ran
Mary ran
Ethel hit Mary
Mary thinks John hit Ethel
John screamed
Noam hopes John screamed
Mary hopes Ethel hit John
Noam kicked Mary
Learned Grammar
S  NP VP
VP  ran
VP  screamed
VP  Vt NP
VP  Vs S
Vt  hit
Vt  kicked
Vs  thinks
Vs  hopes
NP  John
NP  Ethel
NP  Mary
NP  Noam
Evaluation (bits)
Evaluations
450
400
350
300
250
200
150
100
50
0
Overall
Evaluation
Grammar
Data
Initial
Grammar
Learned
Grammar
Dative Alternation
• Children learn distinction between
alternating and non-alternating verbs
• Previously unseen verbs are used
productively in both constructions
 New verbs follow regular pattern
• During learning children use nonalternating verbs in both constructions
 U-shaped learning
Training Data
• Three alternating verbs: gave, passed, lent
• One non-alternating verb: donated
• One verb seen only once: sent
The museum lent Sam a painting
John gave a painting to Sam
Sam donated John to the museum
The museum sent a painting to Sam
Dative Evaluations
Evaluation (bits)
3500
3000
2500
Overall
Evaluation
Grammar
2000
1500
1000
Data
500
0
Initial
Grammar
Learned
Grammar
Grammar Properties
• Learned grammar distinguishes alternating
and non-alternating verbs
• sent appears in alternating class
• With less data, only one class of verbs, so
donated can appear in both constructions
• All sentences generated by the grammar
are grammatical
• But structures are not right
Learned Structures
S
X
Y
Z
NP
NP
VA
DET
John
gave
a
N
painting
P
NP
to
Sam
Regular and Irregular Rules
• Why does the model place a newly seen verb in the
regular class?
Y  VA NP
Y  VA Z
Y  VP Z
VA  passed
VA  gave
VA  lent
VP  donated
VA / VP  sent
sent doesn’t
alternate
sent
alternates
Overall
Evaluation
(bits)
1703.6
1703.4
Grammar
(bits)
322.2
321.0
Data (bits)
1381.4
1382.3
Regular constructions are preferred
because the grammar is coded statistically
Why use Statistical Grammars?
Statistics are a valuable source of information
 They help to infer when absences are due to
chance
The learned grammar predicted that sent should
appear in the double object construction
• but in 150 sentences it was only seen in the
prepositional dative construction
• With a non statistical grammar we need an
explanation as to why this is
• A statistical grammar knows that sent is rare,
which explains the absence of double object
occurrences
Scaling Up: Onnis, Roberts
and Chater (2003)
Causative alternation:
John cut the string
* The string cut
* John arrived the train
The train arrived
John bounced the ball
The ball bounced
Onnis et al’s Data
• Two word classes: N and V
• NV and VN only allowable sentences
16 verbs alternate: NV + VN
10 verbs NV only
10 verbs VN only
Coding scheme marks non-alternating verbs
as exceptional (cost in coding scheme)
Onnis et al’s Results
< 16,000 sentences  all verbs alternate
> 16,000 sentences  non alternating verbs
classified as exceptional
No search mechanism  Just looked at
evaluations with and without exceptions
In expression-induction model quasi-regularities
appear as a result of chance omissions
MDL and MML Issues
• Numeric parameters - accuracy
• Bayes’ optimal classification (not MAP
learning) – Monte Carlo methods
 If we see a sentence, work out the
probability of it for each grammar
 Weighted sum gives probability of
sentence
• Unseen data – zero probability?
One and Two Part Codes
Decoder
1010100111010100101101010001100111100011010110
Grammar
Decoder
Data coded in terms of grammar
1010100111010100101101010001100111100011010110
Data and grammar combined
Grammar
Data
Coding English Texts
Grammar is a frequency for each letter and
for space
• Counts start at one
• We decode a series of letters – and
update the counts for each letter
• All letters coded in terms of their
probabilities at that point in the decoding
• At end we have a decoded text and
grammar
Decoding Example
Letter
Count
Count
Count
Count
A
1
2
2
2
B
1
1
2
3
C
1
1
1
1
Space
1
1
1
1
Decoded
string
A
(P=1/4)
B
(P=1/5)
B
(P=2/6)
One-Part Grammars
Grammars can also be coded using one-part
codes
• Start with no grammar, but have a
probability associated with adding a new
rule
• Each time we decode data we first choose
to add a new rule, or use an existing one
Examples are Dowman (2000) or
Venkataraman (1997)
Conclusions
 MDL can solve the poverty of the stimulus
problem
 But it doesn’t solve the problem of
constraining the search for grammars
 Coding schemes create learning biases
 Statistical grammars and statistical coding of
grammars can help learning