Transcript Document
Statistical learning, crossconstraints, and the acquisition of
speech categories:
a computational approach.
Joseph Toscano & Bob McMurray
Psychology Department
University of Iowa
Acknowledgements
• Acknowledgements:
– Dick Aslin
– The MACLab
Learning phonetic categories
• Infants are initially able to discriminate many
different phonetic contrasts.
• They must learn which ones are relevant to their
native language.
• This is accomplished within the first year of life,
and infants quickly adopt the categories present in
their language (Werker & Tees, 1984).
Learning phonetic categories
• What is needed for statistical learning?
• A signal and a mechanism
– Availability of statistics (signal)
– Sensitivity to statistics (mechanism)
• continuous sensitivity to VOT
• ability to track frequencies and build clusters
Statistics in the signal
• What statistical information is available?
• Lisker & Abramson, 1964 did a cross-language
analysis of speech
– Measured voice-onset time (VOT) from several
speakers in different languages
Statistics in the signal
• The statistics are available in the signal
Tamil
Cantonese
English
Sensitivity to statistics
• Are infants sensitive to statistics in speech?
– Maye et al., 2002 asked this
– Two groups of infants
• Infants are sensitive to within-category detail
(McMurray & Aslin, 2005)
Learning phonetic categories
frequency
• Infants can obtain phoneme categories from
exposure to tokens in the speech signal
+voice
0ms
-voice
VOT
50ms
Statistical Learning Model
•
Statistical learning in a computational model
•
What do we need the model to do:
–
–
–
–
Show learnability. Are statistics sufficient?
Developmental timecourse.
Implications for speech in general.
Can model explain more than category learning?
Statistical Learning Model
• Clusters of VOTs are Gaussian distributions
Tamil
Cantonese
English
Statistical Learning Model
• Gaussians defined by three parameters:
μ – the center of the
distribution
σ – the spread of the
distribution
Φ – the height of the
distribution, reflected by the
probability of a particular
value
Φ
VOT
• Each phoneme category can be represented by
these three parameters
Statistical Learning Model
• Modeling approach: mixture of Gaussians
0.01
Category Mapping Strength (Posterior)
0.009
0.008
/b/
0.007
/p/
0.006
0.005
0.004
0.003
0.002
0.001
0
-80
-60
-40
-20
0
20
40
Phonetic Dimension (e.g. VOT)
60
80
100
Statistical Learning Model
• Gaussian distributions represent the probability of
occurrence of a particular feature (e.g. VOT)
• Start with a large number of Gaussians to reflect
many different values for the feature.
0.01
Category Mapping Strength (Posterior)
0.009
0.008
/b/
0.007
0.006
/p/
0.005
0.004
0.003
0.002
0.001
0
-80
-60
-40
-20
0
20
40
Phonetic Dimension (e.g. VOT)
60
80
100
Statistical Learning Model
• Learning occurs via gradient descent
– Take a single data point as input
– Adjust the location and width of the distribution by a
certain amount, defined by a learning rule
Make the dist
wider to
accommodate
the data point
0.07
Category
Category Mapping Strength (posterior)
Move the center
of the dist closer
to the data point
0.08
0.06
0.05
0.04
0.03
0.02
0.01
0
-80
-60
-40
-20
00
20
40
-20
20
40
Phonetic
Phonetic Dimension
Dimension (e.g.
(e.g. VOT)
VOT)
60
60
80
80
100
100
{
Statistical Learning Model
• Learning rule:
Probability of a
particular point
=
Proportion of
space under that
Gaussian
x
Equation of a
Gaussian
Can the model learn?
• Can the model learn speech categories?
Can the model learn?
• The model in action
• Fails to learn correct number of categories
– Too many distributions under each curve
– Is this a problem? Maybe.
• Solution: Introduce competition
• Competition through winner-take-all strategy
– Only the closest matching Gaussian is adjusted
Does learning need to be constrained?
• Can the model learn speech categories?
• Does learning need to be constrained?
Yes.
Does learning need to be constrained?
• Unconstrained feature space
– Starting VOTs distributed from -1000 to +1000 ms
– Model fails to learn
– Similar to a situation in which the model has too few
starting distributions
Does learning need to be constrained?
• Constrained feature space
– Starting VOTs distributed from -100 to +100 ms
– Within the range of actual voice onset times used in
language.
Are constraints linguistic?
• Can the model learn speech categories?
Yes.
• Does learning need to be constrained?
Yes.
• Do constraints need to be linguistic?
Are constraints linguistic?
• Cross-linguistic constraints
– Combined data from languages used in Lisker &
Abramson, 1964, and several other languages
Are constraints linguistic?
• VOTs from:
–
–
–
–
–
–
–
–
–
–
–
–
–
English
Thai
Spanish
Cantonese
Korean
Navajo
Dutch
Hungarian
Tamil
Eastern Armenian
Hindi
Marathi
French
• Test the model with two different sets of
starting states:
Cross-linguistic: based on
distribution of VOTs across
languages
VOT
Random normally
distributed: centered
around 0ms, range ~ 100ms to +100ms
VOT
• Test the model with two different sets of
starting states:
Cross-linguistic: based on
distribution of VOTs across
languages
Random normally
distributed: centered
around 0ms, range ~ 100ms to +100ms
Are linguistic constraints helpful?
• Can the model learn speech categories?
Yes.
• Does learning need to be constrained?
Yes.
• Do constraints need to be linguistic?
No.
• Do cross-language constraints help?
Are linguistic constraints helpful?
• This is the part of the talk that I don’t have any
slides for yet.
What do infants do?
• Can the model learn speech categories?
Yes.
• Does learning need to be constrained?
Yes.
• Do constraints need to be linguistic?
No.
• Do cross-language constraints help? Sometimes.
• What do infants do?
What do infants do?
• As infants get older, their ability to discriminate
different VOT contrasts decreases.
– Initially able to discriminate many contrasts
– Eventually discriminate only those of their native
language
What do infants do?
• Each model’s discrimination over time
– Random normal: decreases
– Cross-linguistic: slight increase
0.7
0.65
0.6
0.55
0.5
crossling
random normal
Linear (crossling)
0.45
Linear (random normal)
0.4
0.35
0.3
0.25
0.2
0
2000
4000
6000
8000
10000
12000
What do infants do?
• Cross-linguistic starting states lead to faster
category acquisition
• Why wouldn’t infants take advantage of this?
– Too great a risk of over-generalization
– Better to take more time to do the job right than to do
it too quickly