Class 5 - CUNY

Download Report

Transcript Class 5 - CUNY

Introduction to
Language Acquisition Theory
Janet Dean Fodor
St. Petersburg July 2013
Class 5. The Subset Principle:
Essential but impossible?
1
Today: How far to generalize?
 If there is no reliable source of negative data for correcting
overgeneralization errors, they must be prevented in
advance.
 UG cannot do this alone. Not all languages have the same
breadth of generalizations (e.g. Verb-Second / aux-inversion).
 The learner must choose how far to generalize, from
among the UG-compatible (parametric) alternatives.
 Additional sources of innate guidance = strategies of
the learning mechanism:
Uniqueness Principle (Class 3)
Subset Principle: “Guess conservatively” (Class1)
 CRUCIAL QUESTION: Can the Subset Principle (SP) help
learners to find the right balance between conservative
(safe) learning and rapid progress toward the target
grammar?
2
Generalize how far? which way?
 Children don’t have unlimited memory resources.
This makes wise grammar choice more difficult.
 A standard assumption: No storage of raw input.
 Child cannot contemplate a huge stored collection of
examples and look for patterns (as linguists do).
 Incremental learning = retain or change grammar
hypothesis after each input sentence. Memory-free!
 So, on hearing a novel sentence, a child must adopt it
and some other sentences into her language.
(Children don’t acquire only sentences they’ve heard.)
But which other sentences?
 What size and shape is the generalization a learner
formulates, based on that individual input sentence?
3
LM strategies must also be innate
 Even with UG, a learner encountering a novel sentence
has to choose from among many possible generalizations.
 Even in the P&P framework, the number of possible
grammars is in the millions or billions (230 ≈ a billion).
 The choices that LM makes are evidently not random.
All children exposed to a target language arrive at more or
less the same grammar (as far as we can tell).
 Conclusion: Guessing strategies must also be genetically
given. (Perhaps these correspond to linguistic markedness.)
 Traditional term for this: an evaluation metric.
An important component of the learning mechanism
(Chomsky, 1965). It prioritizes grammar hypotheses.
 RESEARCH GOAL: Specify the evaluation metric. Is it
specific to language, or domain-general (e.g., simplicity)?
4
Evaluation metric:
Err in the direction of under-generalization
 If LM undergeneralizes based on one input, the grammar
can be broadened later as more examples are heard.
 And even if not, the consequences are not disastrous:
as an adult, the learner might lack some means of
expression in the language. (Would anyone notice?)
E.g., someone might not acquire the English subjunctive.
 By contrast, if LM generalizes too broadly from an input
sentence, with insufficient negative data to correct it later, it
becomes a permanent error. Very noticeable if it occurs!
  As an adult, s/he would utter incorrect sentences, e.g.
*Went she home? I met the boy *(who) loves Mary.
In L1 acquisition, this happens rarely, if ever.
5
Overgeneralization rarely occurs
 Children rarely overgeneralize syntactic patterns (despite
informal impressions; Snyder 2011).
 There are reports in the literature (e.g. Bowerman:
*Brush me my hair), but remarkably few (considering how
noticeable/cute these errors are). And most are lexical:
a wrong subcategorization for the verb brush.
 More research needed: Search the CHILDES database to
establish the frequency and type of overgeneralization
errors evidenced in children’s spontaneous speech.
What proportion of them are pure syntax?
E.g., How good a dancer is he?
* How good dancers are they?
 In morphology, there are many overgeneralizations: *foots,
*runned. But these can be driven out later by correct forms
(Uniqueness Principle, Class 3).
6
SP’s job is to fend off overgeneralization
 SP informal statement: “…the learner must guess the
smallest possible language compatible with the input at
each step”. (Clark 1992)
 JDF: Good thought, but imprecise. If 2 input-compatible
languages intersect, LM may freely guess either.

 But it’s in the right spirit: A ‘small’ guess is always
safer. It guarantees that if the guess was wrong, there
will be a positive datum (trigger) to put it right later.
 However: Unexpectedly, we will see that hugging the
input tightly with small guesses can impede learning!
 A paradox that we must solve: SP is essential for
learning without negative evidence. But it impedes
acquisition of valid linguistic generalizations.
7
SP: Adopt a ‘smallest superset’ of the input
 On hearing s, the learner may hypothesize either L1 or
L2, but not L3.
L3
s
L1
L2
Assume these are
the only languages
permitted by UG
which contain s.
A ‘smallest superset’ of some set S of sentences is a
language which has no proper subset that contains S.
Both L1 and L2 are smallest supersets of sentence s.
8
SP: Adopt a ‘smallest superset’ of the input
 On hearing s’, the learner must hypothesize L2, not L3.
L3
s
L1
s’
s”
L2
Assume these are
the only languages
permitted by UG
which contain s, s’
and/or s”.
 Hypothesize L3 only on hearing s”.
L3 is now the smallest superset of the current input, s”.
9
SP works badly, in older TG framework 
 In terms of transformational rules: SP would favor the
maximum context restrictions on any transformation:
 CORRECT: Invert aux in the context of a fronted negative.
Not once did I think of my own safety.
 WRONG : Invert aux in the context of a fronted element.
*At once did I think of my own safety.
 SP favors maximum feature specifications for what fronts.
 CORRECT: Invert [+AUX, +V, -N].
 WRONG: Invert [+V, -N].
*Not once thought I of my own safety.
 In other words: In a rule-based framework, SP requires
the learner to prefer the most complex rules! Simplicity
yields generality. Not a plausible evaluation metric.
10
Subset Principle in a parametric model
 Good: P&P theory doesn’t suffer from that dangerous
relation between simplicity (good) and over-generality
(incurable). Because all P-values are equally simple.
 What is required: If parameter triggering is to be automatic
& effortless (as the theory claims), LM must be able to tell
without effort when a subset-superset choice presents itself,
and which of the competing grammars yields the subset.
 Is it sufficient for each parameter to have a default
value? Simple Defaults Model. (Manzini & Wexler 1987)
 That would make it easy to obey SP! If both values of a
parameter P are compatible with the input, adopt the default.
 But no. Natural language examples suggest the parameters
must also be priority-ordered with respect to each other.
(See below: the Ordered Defaults Model)
11
First, let’s check: Do s-s relations actually
occur in the natural language domain?
 Yes. Any parameter with values optional vs obligatory.
 Also, every historical process of either addition or loss




(but not both at once!) creates a s-s relation.
Imagine: Italian with split-DP (as in Latin). Adji Verb Nouni
Actual example: Expansion of the 's genitive since late
Middle English, without loss of the of-genitive.
But not just s-s relations between whole languages.
Must avoid superset errors parameter by parameter.
E.g. If target is English, don’t adopt long-distance
anaphora – even if that language as a whole isn’t a
superset of English (e.g., learner has no passives yet).
Why? Because there’s no link between LDA and Passive
that could trigger a retreat on LDA.
12
Binding theory parameters (Manzini & Wexler,1987)
 Binding principles: Anaphors must be locally bound.
Pronouns must be non-locally bound. What counts as local?
 5 degrees of locality. An anaphor must be bound in the
minimal domain that includes it & a governor for it, & has:
a. A subject
 more local (fewer anaphors, more pronouns)
b. An Infl
c. A tense
d. A ‘referential’ tense
e. a ‘root’ tense  less local (more anaphors, fewer pronouns)
 Creates nested subset languages (5-valued parameter!).
Proof that s-s relations exist in natural languages.
 Other M&W assumptions: The Independence Principle (“the subset
relations that are determined by the values of a parameter hold no
matter what the values of the other parameters are.”)
 M&W assumed the Simple Defaults Model
13
Do children actually apply SP?
 There are cases in which it is claimed that children do not
obey SP: they overgeneralize and then retreat.
 Chien & Wexler (1990) found that children apply Principle A
reliably, at least by age 6; but they make Principle B errors.
Mama bear washed her. Compatible w. picture of self-washing.
 Is Principle B not innate? Does it mature late? If it’s innate
and accessible early, this is a clear violation of SP.
 HOWEVER: C&W also found that the children did not make
mistakes when a quantified antecedent: Every bear washes her.
 Explanation: Principle B blocks co-indexation of a pronoun
with a local antecedent. But only a pragmatic principle
prevents contra-indexed expressions from having the same
referent. It is this pragmatic principle that the children don’t yet
know. They do well with the quantified examples because
that’s not a matter of coreference at all.
14
Do children apply SP? Another example.
 Déprez & Pierce (1993) claim that children sometimes
overgenerate (= SP violation), and later retreat.
 At approx 1½ to 3 years, learners of English have optional
raising of the subject from its VP-internal base position.
No my play my puppet. Neg Subj Verb… (subj is low)
He no bite you.
Subj Neg Verb… (subj has raised)
 “…the errors described here as the result of failure to raise
the subject out of the verb phrase are attributed to a valid
grammatical option to assign nominative Case to that position”
 How do these learners retreat later? Not clearly addressed.
 “earlier stages may be stages in which the grammar has
yet to ‘stabilize’ on unique settings of various parameters”
 So: Is this a real SP-violation? Or just vacillation between two
values (maybe due to confusion in analyzing input - “no”)?
15
SP: summary so far
 Subset-superset relations do hold among natural
languages.
 Not all are attributable to a subset/superset choice
within a single parameter. (See examples, next slide.)
 A Subset Principle that excludes global subset relations
between parameters is more complex for a learner to
impose.
 Nevertheless, child learners in fact commit few or no
superset errors. Some apparent violations of SP are
explicable in other terms (e.g., pragmatic immaturity;
input confusion).
 So our next question: How do they do it?
16
How can LM know what’s a superset to avoid?
 Simple Defaults Model: Easy to read off s-s relations.
Prefer 0 over 1. (0 = default; 1 = marked value)
 E.g. Grammar 01101 licenses languages that have proper
subsets licensed by grammars 00101, 01001, 01100,
and their proper subsets: 00001, 00100, 01000 and 0000.
 So avoid 01101 if any of these others fits the input.
 But in the natural language domain, there are s-s
relations that exceed bounds of a single parameter.
 Ex-1: WH-fronting (subset) versus Scrambling that includes
scrambling of wh-items (superset; WH may be initial or not).
A choice for learner to make: English versus Japanese.
 Ex-2: Optional topic (subset) versus obligatory topic that is
optionally null (superset; obligatory XPs can be missing).
A choice for learner to make: English versus German.
17
Ordered Defaults Model (Fodor & Sakas)
 Child hears: Which apples shall we buy?
Decision: Which parameter to set to its marked value?
 SP requires the prioritization:
0 Wh-movt
0 Scrambling

(0 = default; 1 = marked value)
1 Wh-movt
0 Scrambling

0 Wh-movt
1 Scrambling
 How could this crucial prioritization be economically mentally
represented by learners?
 One way: Innately order the parameters such that parameters
earlier in the ordering get set to their marked values before
parameters later in the ordering. Cf. Gold’s ‘enumeration’!
 Priority: 10000 > 01000 > 00100 etc.
And for two marked values: 11000 > 10100 > 10010 etc.
18
But: Where does the parameter ordering come from?
 Learning strategies cannot be learned! Must be innate.
 So: Learners couldn’t learn this parameter ordering that
enforces SP. So what is its source?
 Could it be innate, supplied by UG? But HOW?
(a) Could evolution have crafted a precise mental map of the
s-s relationships between all possible natural languages?
(b) Or could the LM calculate these complex subset-superset
relations on-line? Surely a serious computational overload!
 Neither is plausible! Perhaps instead, the ‘black hole’
explanation. There is an innate mental ordering but it’s
completely arbitrary, not shaped by evolution. The learnable
languages (which are the only ones linguists know about)
are the ones that just happen to precede their supersets.
19
SP is necessary, but can be harmful too
 SP can guide learners to making safe guesses, where a
choice must be made in generalizing from a specific input.
It can prevent learners from overshooting the target lg. 
 But: it interacts badly with incremental learning. 
 For an incremental learner, SP may cause learning
failures due to undershooting the target language.
 When setting a parameter to accommodate a new
sentence, SP requires LM to give up much of what it
had acquired from previous input. (Example below.)
We call this retrenchment.
 Why retrench? Because the combination of a previously
hypothesized language plus a new hypothesis might be
a superset of the target, violating SP. (See diagram). 
 Must clean up, when abandoning a falsified grammar.
20
SP demands maximum retrenchment!
Lsuperset
Lother
Lcurrent
s
t
Sentences in the shaded area of Lcurrent must be given up when
LM encounters input t which falsifies Lcurrent.
 Learner first heard s, and happened to guess Lcurrent.
 Learner now hears t, so realizes that Lcurrent is wrong.
 Learner doesn’t know which parts of Lcurrent are wrong.
 So to be safe, all of it must be discarded, except what
input sentence t entails in conjunction with UG.
 All the striped sentences must be ditched.
21
Clark’s example (Class 2):
retrenchment needed
 Without retrenchment, one mistake can lead to another.
 Clark (1989) showed a sequence of steps that can end
in a superset error, although no one step violates SP.
(1) John believes them(ACC) to have left.
 Is this exceptional case marking (ECM)?
Or structural case assignment (SCA)?
(2) They believe themselves to be smart.
 Is the anaphor in the binding domain of they? (yes, if ECM)
Or is this long-distance anaphora (LDA)? (yes, if SCA)
22
Without retrenchment  errors
 Suppose the target is English (ECM; no LDA).
 Dangerous parameter-setting sequence:
 Guess SCA for sentence (1). Wrong, but no problem.
 Then guess LDA for sentence (2). Still OK so far!
 Later, learner hears an unambiguous trigger for ECM:
(3) Sue is believed by John to have left.
 On adopting ECM, learner must give up SCA.
 And must also give up LDA (which was based on SCA).
Otherwise, would mis-generate:
(4) * He wants us to like himself.
23
SP demands massive retrenchment!
 Retrenchment must eliminate all sentences except those
that LM knows are valid.
 In incremental learning (usually assumed), LM cannot
consult previously heard sentences. Only the current one!
 In P&P framework: SP requires giving up all marked
parameter values except those that LM knows are correct.
 Unless LM learns only from fully unambiguous triggers
(unrealistic?), it can’t be sure that any previously set
parameters were set correctly. So give up all that aren’t
entailed by the current input – set them back to defaults.
 In real-life, retrenchment would be extreme. Child hears
“It’s bedtime.” Must discard all knowledge previously
acquired except if entailed by that sentence plus UG.
24
SP demands excessive retrenchment
 Child hears “It’s bedtime” and discards topicalization,
wh-movement, aux-inversion, etc.
 Could re-learn them. But retrenchment applies again
and again. No knowledge could be retained for long.
 Without SP  overgeneration.

With SP  massive undergeneration, if incremental. 
 The target grammar could be acquired only if all its
parameters were set by a single input sentence!

 This disastrous interaction between SP and incremental
learning wasn’t noticed prior to Fodor & Sakas (2005).
 Most SP discussions implicitly assume without comment that
LM has unlimited access to prior input.
25
How could SP be reconciled with incrementality?
 How can retrenchment be tamed? By adding memory,
of one kind or another.
 Memory for previous inputs? Too many? But maybe ok to
store only inputs that triggered a parameter change.
 Problem: Now, LM has to fit parameter values to a
collection of accumulated input sentences. Could that
be done by triggering??
 Instead: Memory for previously disconfirmed grammars?
Then SP demands only that LM adopt a ‘smallest language’
which fits the current input sentence and which
hasn’t yet been disconfirmed. The ‘smallest’ would get larger
and larger as learning proceeds  less retrenching.
 Problem: How to mentally represent which grammars
have been disconfirmed? A huge list? A point on an
enumeration? – but Pinker’s objection!
26
A lattice representing all s-s relations
 This (as shown) is less than 1% of the total lattice for the
3,072 languages in the CoLAG domain. But only 7-deep.
Supersets 
Fodor, Sakas &
Hoskey 2007
Subsets 
 LM must select a grammar from the lowest level.
 When a grammar is disconfirmed, delete it.
 Lattice shrinks from the bottom up  larger languages.
27
The lattice model would work, but….
 But, a very ad hoc solution. Why would this huge data
structure be part of a child’s innate endowment?
 Worse still: It codes relations between E-languages,
whereas we would like to assume that the human mind
stores only I-language facts.
 The only alternative to adding memory is:
Trust only unambiguous input.
 SP does not demand any retrenchment if LM knows
that all the marked parameter values it has adopted so
far are correct – because they were adopted in
response to unambiguous triggers.
 We’ll explore this solution next time (Class 6).
28
Writing assignment
 A young child acquires the syntax of Russian in 5 or 6
years. Many linguists have worked for many decades
(centuries!) to discover the syntactic properties of Russian.
 Are children smarter than linguists? If not, what else might
explain this discrepancy?
 In what relevant respects do linguists have stronger
resources than children? In what respects do children
have stronger resources?
 2 or 3 pages, please, to hand in at Monday’s class.
29