Transcript G targ

Modeling Parameter Setting Performance in Domains
with a Large Number of Parameters:
A Hybrid Approach
CUNY / SUNY / NYU Linguistics Mini-conference
March 10, 2001
William Gregory Sakas
&
Dina Demner-Fushman
1
Primary point (to make [quickly, painlessly] on a
Saturday just before lunch):
Not enough to build a series of computer simulations of
a cognitive model of human language acquisition and claim
that it mirrors the process by which a child acquires
language.
The (perhaps obvious) fact is that learners are acutely
sensitive to cross-language ambiguity.
Whether or not a learning model is ultimately successful
as a cognitive model is an empirical issue; depends on
the ‘fit’ of the simulations with the facts about the
distribution of ambiguity in human languages.
2
What’s coming:
1) Some early learnability results
2) A feasibility case study analysis of one
parameter setting model : The Triggering
Learning Algorithm (TLA) Gibson and Wexler
(1994)
3) Conjectures and a proposed research agenda
3
Why computationally model natural language
acquisition?
Pinker (1979) :
"...it may be necessary to find out how language learning
could work in order for the developmental data to tell
us how is does work." [emphasis mine]
4
Learnability Is the learner guaranteed to converge on
the target grammar for every language in a given
domain?
Gold (1967), Wexler and Culicover (1980), Gibson & Wexler (1994), Kanazawa (1994)
An early learnability result (Gold, 1967)
Exposed to input strings of an arbitrary target
language generated by grammar Gtarg, it is impossible
to guarantee that any learner can converge on Gtarg if
Gtarg is drawn from any class in the Chomsky hierarchy.
(E.g. context-free grammars).
5
Gold’s result is sometimes taken to be strong
evidence for a nativist Universal Grammar.
1) Psycholinguistic research indicates that children learn
grammar based on positive exemplar sentences.
2) Gold proves that Greg Gcfg Gcs Gre can’t be learned this way.
Conclude:
some grammatical competence must be in place before
learning commences.
Gold’s result is often misapplied, but much discussion.
6
Another Learnability result:
All classes of grammars possible within the principles
and parameters framework are learnable because they
are finite.
In fact a simple Blind Guess Learner is guaranteed to
succeed in the long run for any finite class of grammars.
Blind Guess Learner:
1. randomly hypothesize a current grammar
2. consume and attempt to parse a sentence from the
linguistic environment
3. If the sentence is parsable by the current grammar, go to 2.
Otherwise go to 1.
7
Feasibility Is acquisition possible within a reasonable
amount of time and/or with a reasonable amount
of work?
Clark (1994, in press), Niyogi and Berwick (1996), Lightfoot (1989) (degree-0), Sakas(2000),
Tesar and Smolensky (1996) and many PAC results concerning induction of FSA’s
Feasibility measure (Sakas and Fodor, in press)
Near linear increase of the expected number of
sentences consumed before a learner converges on
the target grammar.
8
Feasibility result: The Blind guess learner succeeds only
after consuming a number of sentences exponentially
correlated with the number of parameters.
If # Parameters = 30
then # Grammars = 230
= 1,073,741,824
The search space is huge!
9
A Feasibility Case Study :
A three parameter domain (Gibson and Wexler, 1994)
SV / VS - subject precedes verb / verb precedes subject
VO / OV - verb precedes object / object precedes verb
+V2 / -V2 - verb or aux must be in the second position in the sentence
Sentences are strings of the symbols: S, V, 01, 02, aux, adv
Allie will eat the birds  S aux V O
10
Two example languages (finite, degree-0)
SV VO -V2
(English-like)
SV
SVO
S V O1 O2
S AUX V
S AUX V O
S AUX V O1 O2
ADV S V
ADV S V O
ADV S V O1 O2
ADV S AUX V
ADV S AUX V O
ADV S AUX V O1 O2
SV OV +V2
(German-like)
ADV V S
ADV V S O
ADV V S O2 O1
ADV AUX S V
ADV AUX S O V
ADV AUX S O2
O1 V
SV
SVO
OVS
S V O2 O1
O1 V S O2
O2 V S O1
S AUX V
S AUX O V
O AUX S V
S AUX O2 O1 V
O1 AUX S O2 V
O2 AUX S O1 V
11
Surprisingly, G&W’s simple 3-parameter domain
presents nontrivial obstacles to several types of
learning strategies, but the space is ultimately
learnable.
Big question:
How will the learning process scale up in terms of
feasibility as the number of parameters increases?
Two problems for most acquisition strategies:
1) Ambiguity
2) Size of the domain
12
Cross-language ambiguity
SV VO -V2
(English-like)
SV
SVO
S V O1 O2
S AUX V
S AUX V O
S AUX V O1 O2
ADV S V
ADV S V O
ADV S V O1 O2
ADV S AUX V
ADV S AUX V O
ADV S AUX V O1 O2
SV OV +V2
(German-like)
ADV V S
ADV V S O
ADV V S O2 O1
ADV AUX S V
ADV AUX S O V
ADV AUX S O2
O1 V
SV
SVO
OVS
S V O2 O1
O1 V S O2
O2 V S O1
S AUX V
S AUX O V
O AUX S V
S AUX O2 O1 V
O1 AUX S O2 V
O2 AUX S O1 V
Indicates a few ambiguous strings
13
P&P acquisition:
How to obtain informative feasibility results
studying linguistically interesting domains with
cognitively plausible learning algorithms?
14
Create an input space for a linguistically plausible
domain. But won't work for large domains.
-- simulations.
(Briscoe (2000), Clark(1992), Elman (1990, 1991,1996), Yang (200))
So, how to answer questions of feasibility
as the number of grammars (exponentially)
scales up?
Answer:
introduce some formal notions in order to abstract
away from the specific linguistic content of the
input data.
15
A hybrid approach (formal/empirical)
1) formalize the learning process and
input space
2) use the formalization in a Markov
structure to empirically test the learner
across a wide range of learning
scenarios
The framework gives general data on
the expected performance of
acquisition algorithms. Can answer
the question:
Given learner L, if the input space
exhibits characteristics x, y and z, is
feasible learning possible?
16
Syntax acquisition can be viewed as a state space search
— nodes represent grammars including a start state and a
target state.
— arcs represent a possible change from one hypothesized
grammar to another.
011
100
000
101
001
010
111
110
Gtarg
A possible state space for parameter space with 3 parameters.
17
The Triggering Learning Algorithm -TLA (Gibson and Wexler, 1994)
Searches the (huge) grammar space using local heuristics
Error-driven
repeat until convergence:
receive a string s from L(Gtarg)
SVC
if it can be parsed by Gcurr, do nothing
otherwise, pick a grammar that differs by
one parameter value from the current grammar
if this grammar parses the sentence,
make it the current grammar, otherwise do nothing
Greediness
18
Local state space for the TLA
SVC
If Gcurr = 010, then Gattempt = random G  { 000, 110, 011 }
Error-driven
110
000
sL(Gcurr)
sL(Gcurr)  Gattempt = 000  sL(G000)
010
sL(Gattempt)
Greediness
sL(Gcurr)  Gattempt = 110  sL(G110)
sL(Gcurr)  Gattempt = 011  sL(G011)
011
19
Probabilistic Formulation of TLA performance
i denotes the ambiguity factor
i = Pr ( s  L(Gtarg)  L(Gi) )
i,j denotes the overlap factor
i,j = Pr (s  L(Gi)  L(GJ) )
i denotes the probability of picking or
"looking ahead” at a new hypothesis grammar Gi
20
The probability that the learner moves from state
Gcurr to state Gnew
= (1-curr) ( new ) Pr (Gnew can parse s|Gcurr can’t parse s)
Error-driven
SVC
Greediness
After some algebra:
Pr(Gcurr  Gnew) = ( new) ( new- curr, new )
21
Parameter space H4 with Gtarg = 1111. Each ring or G-Ring contains exactly those
grammars a certain hamming distance from the target. For example, ring G2 contains 0011,
0101, 1100, 1010, 1001 and 0110 all of which differ from the target grammar 1111 by 2 bits.
G-Ring G2
grammars
G-Ring G4
1000
0100
0011
1101
0101
0110 1110
1111
0111
1010
1011
0010
1001
1100
0001
target
grammar
Gtarg
0000
22
Smoothness - there exists a correlation between
the similarity of grammars and the similarity of the
languages that they generate.
Weak Smoothness Requirement - All the members
of a G-Ring can parse s with an equal probability.
Strong Smoothness Requirement - The parameter
space is weakly smooth and the probability that s
can be parsed by a member of a G-Ring increases
monotonically as distance from the target
decreases.
23
Experimental Setup
1) Adapt the formulas for the transition
probabilities to work with G-rings
2) Build a generic transition matrix into which
varying values of  and  can be employed
3) Use standard Markov technique to calculate
the expected number of inputs consumed by
the system
(construct the fundamental matrix)
Goal - find the ‘sweet spot’ for TLA
performance
24
Three experiments
1) G-Rings equally likely to parse an input sentence
(uniform domain)
2) G-Rings are strongly smooth (smooth domain)
3) Anything goes domain
Problem:
How to find combinations of  and  that are optimal?
Solution:
Use an an optimization algorithm: GRG2 (Lasdon and
Waren, 1978).
25
Result 1: The TLA performs worse than blind
guessing in a uniform domain - exponential
increase in # sentences
10,000,000,000
Logarithmic
scale
1,000,000,000
# sents consum ed
100,000,000
10,000,000
1,000,000
100,000
Blind guess learner
10,000
TLA
1,000
100
10
1
0
5
10
15
20
25
30
35
# of param eters
Results obtained employing optimal values of  and 
26
Result 2: The TLA performs extremely well in
a smooth domain - but still nonlinear
increase
4,000
Linear scale
3,500
# of sents
3,000
2,500
2,000
1,500
1,000
500
0
0
10
20
30
40
# of param eters
Results obtained employing optimal values of  and 
27
Result 3: The TLA performs a bit better in the
Anything Goes scenario - optimizer
chooses ‘accelerating’ strong smoothness
4000
Linear scale
3500
# of se nte nce s
3000
Accelerating
Linear
2500
2000
1500
1000
500
0
0
5
10
15
20
25
30
35
# of parameters
Results obtained employing optimal values of  and 
28
In summary
TLA is an infeasible learner:
With cross-language ambiguity uniformly
distributed across the domain of languages,
the number of sentences required by the the
number of sentences consumed by the TLA is
exponential in the number of parameters.
TLA is a feasible learner:
In strongly smooth domains, the number of
sentences increases at a rate much closer to
linear as the number of parameters increases
(i.e. the number of grammars increases
exponentially).
29
No Best Strategy Conjecture (roughly in the
spirit of Schaffer, 1994):
Algorithms may be extremely efficient in
specific domains, but not in others; there is
generally no best learning strategy.
Recommends:
Have to know the specific facts about the
distribution or shape of ambiguity in natural
language.
30
Research agenda:
Three-fold approach to building a cognitive
computational model of human language acquisition:
1) formulate a framework to determine what
distributions of ambiguity make for feasible learning
2) conduct a psycholinguistic study to determine if
the facts of human (child-directed) language are in
line with the conducive distributions
3) conduct a computer simulation to check for
performance nuances and potential obstacles (e.g.
local max based on defaults, or subset principle
violations)
31
Tag Slides (if time):
Why is Gold's theorem so often misapplied?
32
Gold’s result is sometimes taken to be strong
evidence for a nativist Universal Grammar.
1) Psycholinguistic research indicates that children learn
grammar based on positive exemplar sentences.
2) Gold says, Greg Gcfg Gcs Gre can’t be learned this way.
Conclude:
some grammatical competence must be in place before
learning commences.
Gold’s result is often misapplied, but much discussion.
33
Gold misapplied.
Tacit assumptions:
human language is a computable process
By Church/Turing no computational model is more
powerful than Lre .
Hence, Lhuman is a subset of Lre
The Chomsky hierarchy is an appropriate framework
in which to examine Lhuman
Many, many interesting formal results about CFG’s,
automata, etc. But where does Lhuman lie?
34
Given the language as computation assumption, and Gold’s
result, it may be that the class of human languages
intersects the classes of the Chomsky Hierarchy.
Lre
Angluin’s Theorem
(1980) - provides
necessary and
sufficient conditions
for such a class.
Lcs
Lcfg
Lhuman
Lreg
35