Transcript Document

Statistical learning, crossconstraints, and the acquisition of
speech categories:
a computational approach.
Joseph Toscano & Bob McMurray
Psychology Department
University of Iowa
Acknowledgements
• Acknowledgements:
– Dick Aslin
– The MACLab
Learning phonetic categories
• Infants are initially able to discriminate many
different phonetic contrasts.
• They must learn which ones are relevant to their
native language.
• This is accomplished within the first year of life,
and infants quickly adopt the categories present in
their language (Werker & Tees, 1984).
Learning phonetic categories
• What is needed for statistical learning?
• A signal and a mechanism
– Availability of statistics (signal)
– Sensitivity to statistics (mechanism)
• continuous sensitivity to VOT
• ability to track frequencies and build clusters
Statistics in the signal
• What statistical information is available?
• Lisker & Abramson, 1964 did a cross-language
analysis of speech
– Measured voice-onset time (VOT) from several
speakers in different languages
Statistics in the signal
• The statistics are available in the signal
Tamil
Cantonese
English
Sensitivity to statistics
• Are infants sensitive to statistics in speech?
– Maye et al., 2002 asked this
– Two groups of infants
• Infants are sensitive to within-category detail
(McMurray & Aslin, 2005)
Learning phonetic categories
frequency
• Infants can obtain phoneme categories from
exposure to tokens in the speech signal
+voice
0ms
-voice
VOT
50ms
Statistical Learning Model
•
Statistical learning in a computational model
•
What do we need the model to do:
–
–
–
–
Show learnability. Are statistics sufficient?
Developmental timecourse.
Implications for speech in general.
Can model explain more than category learning?
Statistical Learning Model
• Clusters of VOTs are Gaussian distributions
Tamil
Cantonese
English
Statistical Learning Model
• Gaussians defined by three parameters:
μ – the center of the
distribution
σ – the spread of the
distribution
Φ – the height of the
distribution, reflected by the
probability of a particular
value
Φ

VOT

• Each phoneme category can be represented by
these three parameters
Statistical Learning Model
• Modeling approach: mixture of Gaussians
0.01
Category Mapping Strength (Posterior)
0.009
0.008
/b/
0.007
/p/
0.006
0.005
0.004
0.003
0.002
0.001
0
-80
-60
-40
-20
0
20
40
Phonetic Dimension (e.g. VOT)
60
80
100
Statistical Learning Model
• Gaussian distributions represent the probability of
occurrence of a particular feature (e.g. VOT)
• Start with a large number of Gaussians to reflect
many different values for the feature.
0.01
Category Mapping Strength (Posterior)
0.009
0.008
/b/
0.007
0.006
/p/
0.005
0.004
0.003
0.002
0.001
0
-80
-60
-40
-20
0
20
40
Phonetic Dimension (e.g. VOT)
60
80
100
Statistical Learning Model
• Learning occurs via gradient descent
– Take a single data point as input
– Adjust the location and width of the distribution by a
certain amount, defined by a learning rule
Make the dist
wider to
accommodate
the data point
0.07
Category
Category Mapping Strength (posterior)
Move the center
of the dist closer
to the data point
0.08

0.06
0.05
0.04
0.03
0.02

0.01
0
-80
-60
-40
-20
00
20
40
-20
20
40
Phonetic
Phonetic Dimension
Dimension (e.g.
(e.g. VOT)
VOT)
60
60
80
80
100
100
{
Statistical Learning Model
• Learning rule:
Probability of a
particular point
=
Proportion of
space under that
Gaussian
x
Equation of a
Gaussian
Can the model learn?
• Can the model learn speech categories?
Can the model learn?
• The model in action
• Fails to learn correct number of categories
– Too many distributions under each curve
– Is this a problem? Maybe.
• Solution: Introduce competition
• Competition through winner-take-all strategy
– Only the closest matching Gaussian is adjusted
Does learning need to be constrained?
• Can the model learn speech categories?
• Does learning need to be constrained?
Yes.
Does learning need to be constrained?
• Unconstrained feature space
– Starting VOTs distributed from -1000 to +1000 ms
– Model fails to learn
– Similar to a situation in which the model has too few
starting distributions
Does learning need to be constrained?
• Constrained feature space
– Starting VOTs distributed from -100 to +100 ms
– Within the range of actual voice onset times used in
language.
Are constraints linguistic?
• Can the model learn speech categories?
Yes.
• Does learning need to be constrained?
Yes.
• Do constraints need to be linguistic?
Are constraints linguistic?
• Cross-linguistic constraints
– Combined data from languages used in Lisker &
Abramson, 1964, and several other languages
Are constraints linguistic?
• VOTs from:
–
–
–
–
–
–
–
–
–
–
–
–
–
English
Thai
Spanish
Cantonese
Korean
Navajo
Dutch
Hungarian
Tamil
Eastern Armenian
Hindi
Marathi
French
• Test the model with two different sets of
starting states:
Cross-linguistic: based on
distribution of VOTs across
languages
VOT
Random normally
distributed: centered
around 0ms, range ~ 100ms to +100ms
VOT
• Test the model with two different sets of
starting states:
Cross-linguistic: based on
distribution of VOTs across
languages
Random normally
distributed: centered
around 0ms, range ~ 100ms to +100ms
Are linguistic constraints helpful?
• Can the model learn speech categories?
Yes.
• Does learning need to be constrained?
Yes.
• Do constraints need to be linguistic?
No.
• Do cross-language constraints help?
Are linguistic constraints helpful?
• This is the part of the talk that I don’t have any
slides for yet.
What do infants do?
• Can the model learn speech categories?
Yes.
• Does learning need to be constrained?
Yes.
• Do constraints need to be linguistic?
No.
• Do cross-language constraints help? Sometimes.
• What do infants do?
What do infants do?
• As infants get older, their ability to discriminate
different VOT contrasts decreases.
– Initially able to discriminate many contrasts
– Eventually discriminate only those of their native
language
What do infants do?
• Each model’s discrimination over time
– Random normal: decreases
– Cross-linguistic: slight increase
0.7
0.65
0.6
0.55
0.5
crossling
random normal
Linear (crossling)
0.45
Linear (random normal)
0.4
0.35
0.3
0.25
0.2
0
2000
4000
6000
8000
10000
12000
What do infants do?
• Cross-linguistic starting states lead to faster
category acquisition
• Why wouldn’t infants take advantage of this?
– Too great a risk of over-generalization
– Better to take more time to do the job right than to do
it too quickly