mlcs07_goldwater_sharon

Download Report

Transcript mlcs07_goldwater_sharon

A Bayesian approach to word
segmentation: Theoretical and
experimental results
Sharon Goldwater
Department of Linguistics
Stanford University
Word segmentation


One of the first problems infants must solve
when learning language.
Infants make use of many different cues.


Phonotactics, allophonic variation, metrical (stress) patterns,
effects of coarticulation, and statistical regularities in syllable
sequences.
Statistics may provide initial bootstrapping.
 Used
very early (Thiessen & Saffran, 2003).
 Language-independent.
Modeling statistical segmentation


Previous work often focuses on how statistical
information (e.g., transitional probabilities) can
be used to segment speech.
Bayesian approach asks what information
should be used by a successful learner.


What statistics should be collected?
What assumptions (by the learner) constrain
possible generalizations?
Outline
1. Computational model and theoretical results

What are the consequences of using different sorts
of information for optimal word segmentation?
(joint work with Tom Griffiths and Mark Johnson)
2. Modeling experimental data

Do humans behave optimally?
(joint work with Mike Frank, Vikash Mansinghka, Tom Griffiths,
and Josh Tenenbaum)
Statistical segmentation

Work on statistical segmentation often
discusses transitional probabilities (Saffran et al.
1996; Aslin et al. 1998, Johnson & Jusczyk, 2001).


P(syli | syli-1) is often lower at word boundaries.
What do TPs have to say about words?

Or… 
A word is a unit whose beginning predicts its end,
but it does not predict other words.
A word is a unit whose beginning predicts its end,
and it also predicts future words.
Interpretation of TPs

Most previous work assumes words are
statistically independent.
 Experimental
work: Saffran et al. (1996), many others.
tupiro
golabu
bidaku
padoti
 Computational

golabubidakugolabutupiropadotibidakupadotitupi…
work: Brent (1999).
What about words predicting other words?
Questions

If a learner assumes that words are independent
units, what is learned (from more realistic input)?
 Unigram

model: Generate each word independently.
What if the learner assumes that words are units
that help predict other units?
 Bigram
model: Generate each word conditioned on
the previous word.
Approach: use a Bayesian ideal observer model to
examine the consequences of each assumption.
Bayesian learning

The Bayesian learner seeks to identify an
explanatory linguistic hypothesis that
 accounts
for the observed data.
 conforms to prior expectations.

Focus is on the goal of computation, not the
procedure (algorithm) used to achieve the goal.
Bayesian segmentation

In the domain of segmentation, we have:
 Data:
unsegmented corpus (transcriptions).
 Hypotheses: sequences of word tokens.
= 1 if concatenating words forms corpus,
= 0 otherwise.

Encodes unigram or bigram
assumption (also others).
Optimal solution is the segmentation with
highest prior probability.
Brent (1999)

Describes a Bayesian unigram model for
segmentation.
 Prior
favors solutions with fewer words, shorter
words.

Problems with Brent’s system:
 Learning
algorithm is approximate (non-optimal).
 Difficult to extend to incorporate bigram info.
A new unigram model (Dirichlet process)
Assume word wi is generated as follows:
1. Is wi a novel lexical item?
P ( yes ) 

n 
n
P ( no) 
n 
Fewer word types =
Higher probability
A new unigram model (Dirichlet process)
Assume word wi is generated as follows:
2. If novel, generate phonemic form x1…xm :
m
P( wi  x1... xm )   P( xi )
i 1
Shorter words =
Higher probability
If not, choose lexical identity of wi from previously
occurring words:
count (l )
P( wi  l ) 
n
Power law =
Higher probability
Unigram model: simulations

Same corpus as Brent (Bernstein-Ratner, 1987):
 9790
utterances of phonemically transcribed childdirected speech (19-23 months).
 Average utterance length: 3.4 words.
 Average word length: 2.9 phonemes.

Example input:
yuwanttusiD6bUk
lUkD*z6b7wIThIzh&t
&nd6dOgi
yuwanttulUk&tDIs
...
Example results
Comparison to previous results

Proposed boundaries are more accurate than Brent’s,
but fewer proposals are made.
Boundary
Precision
Brent
Boundary
Recall
.80
.85
Precision: #correct / #found
=
[= hits / (hits + false alarms)]
Recall:
GGJ

.92
.62
=
#found / #true
[= hits / (hits + misses)]
Result: word tokens are less accurate.
Token F-score
Brent
.68
GGJ
.54
F-score: an average of
precision and recall.
What happened?

Model assumes (falsely) that words have the
same probability regardless of context.
P(D&t) = .024

P(D&t|WAts) = .46
P(D&t|tu) = .0019
Positing amalgams allows the model to capture
word-to-word dependencies.
What about other unigram models?

Brent’s learning algorithm is insufficient to
identify the optimal segmentation.
 Our
solution has higher probability under his model
than his own solution does.
 On randomly permuted corpus, our system achieves
96% accuracy; Brent gets 81%.

Formal analysis shows undersegmentation is
the optimal solution for any (reasonable)
unigram model.
Bigram model (hierachical Dirichlet process)
Assume word wi is generated as follows:
1.
Is (wi-1,wi) a novel bigram?
P( yes ) 
2.

nwi 1 
P(no) 
nwi 1
nwi 1 
If novel, generate wi using unigram model (almost).
If not, choose lexical identity of wi from words
previously occurring after wi-1.
count (l ' , l )
P( wi  l | wi 1  l ' ) 
count (l ' )
Example results
Quantitative evaluation

Compared to unigram model, more boundaries are
proposed, with no loss in accuracy:
Boundary
Precision

Boundary
Recall
GGJ (unigram)
.92
.62
GGJ (bigram)
.92
.84
Accuracy is higher than previous models:
Token F-score
Type F-score
Brent (unigram)
.68
.52
GGJ (bigram)
.77
.63
Summary


Two different assumptions about what defines a
word are consistent with behavioral evidence.
Different assumptions lead to different results.
 Beginning
of word predicts end of word:
Optimal solution undersegments, finding common
multi-word units.
 Word also predicts next word:
Segmentation is more accurate, adult-like.
Remaining questions




Is unigram segmentation sufficient to start
bootstrapping other cues (e.g., stress)?
How prevalent are multi-word chunks in infant
vocabulary?
Are humans able to segment based on bigram
statistics?
Is there any evidence that human performance
is consistent with Bayesian predictions?
Testing model predictions
Goal: compare our model (and others) to human
performance in a Saffran-style experiment.
tupiro
golabu
bidaku
padoti


golabubidakugolabutupiropadotibidakupadotitupiro…
Problem: all models have near-perfect accuracy
on experimental stimuli.
Solution: compare changes in model
performance relative to humans as task difficulty
is varied.
Experimental method
Examine segmentation performance under
different utterance length conditions.
 Example lexicon: lagi dazu tigupi bavulu kabitudu kipavazi
 Conditions:
# wds/utt
# utts
tot # wds
1
1200
1200
2
600
1200
4
300
1200
6
200
1200
8
150
1200
12
100
1200
24
50
1200
Procedure

Training: adult subjects listened to synthesized
utterances in one length condition.
 No
pauses between syllables within utterances.
 500 ms pauses between utterances.

Testing: 2AFC between words and part-word
distractors.
Lexicon: lagi
dazu
tigupi
bavulu
kabitudu
kipavazi
lagitigupibavulukabitudulagikipavazi
dazukipavazibavululagitigupikabitudu
kipavazitigupidazukabitudulagitigupi
…
Human performance
Model comparison



Evaluated six different models.
Each model trained and tested on same stimuli
as humans.
To simulate 2AFC, produce a score s(w) for each
word in choice pair and use Luce choice rule:
s( w1 )
P( w1 ) 
s( w1 )  s( w2 )

Compute best linear fit of each model to human
data, then calculate correlation.
Models used

Three local statistic models, all similar to transitional
probabilities (TP)



Swingley (2005)



Builds lexicon using local statistic and frequency thresholds.
s(w) = max threshold at which w appears in lexicon.
PARSER (Perruchet and Vinter, 1998)



Segment at minima of P(syli | syli-1).
s(w) = minimum TP in w.
Incorporates principles of lexical competition and memory decay.
s(w) = P(w) as defined by model.
GGJ (our Bayesian model)

s(w) = P(w) as defined by model.
Results: linear fit
Results: words vs. part-words
Summary



Statistical segmentation is more difficult when
utterances contain more words.
Gradual decay in performance is predicted by
Bayesian model, but not by others tested.
Bayes predicts difficulty is primarily due to
effects of competition.
 In
longer utterances, correct words are less probable
because more other possibilities exist.
 Local statistic approaches don’t model competition.
Continuing work
Experiments with other task modifications will
further test our model’s predictions.
 Vary the length of exposure to training stimulus:
 Bayes:
longer exposure => better performance.
 TPs: no effect of exposure.

Vary the number of lexical items:
 Bayes:
larger lexicon => worse performance.
 TPs: larger lexicon => better performance.
Conclusions
Computer simulations and experimental work
suggest that
 Unigram assumption causes ideal learners to
undersegment fluent speech.
 Human word segmentation may approximate
Bayesian ideal learning.
Bayesian segmentation
Input data:
Some hypotheses:
whatsthat
thedoggie
wheresthedoggie
...
whatsthat
thedoggie
wheresthedoggie
whats that
the doggie
wheres the doggie
wh at sth at
thedo ggie
wh eres thedo ggie
w h a t s t h a t
t h e d o g g i e
w h e r e s t h e d o g g i e
Search algorithm
Model defines a distribution over hypotheses. We
use Gibbs sampling to find a good hypothesis.
 Iterative procedure produces samples from the
posterior distribution of hypotheses.
P(h|d)
h

A batch algorithm (but online algorithms are
possible, e.g., particle filtering).
Gibbs sampler
1. Consider two hypotheses differing by a single
word boundary:
whats.that
the.doggie
wheres.the.doggie
whats.that
the.dog.gie
wheres.the.doggie
2. Calculate probabilities of words that differ, given
current analysis of all other words.

Model is exchangeable: probability of a set of
outcomes does not depend on ordering.
3. Sample one of the two hypotheses according to
the ratio of probabilities.
Models used

Transitional probabilities (TP)
 Segment
at minima of P(syli | syli-1).
 s(w) = minimum TP in w. (Equivalently, use product).

Smoothed transitional probabilities
 Avoid

0 counts by using add-λ smoothing.
Mutual information (MI)
 Segment
where MI between syllables is lowest.
 s(w) = minimum MI in w. (Equivalently, use sum).
Models used

Swingley (2005)
 Builds
a lexicon, including syllable sequences above
some threshold for both MI and n-gram frequency.
 s(w) = max threshold at which w appears in lexicon.

PARSER (Perruchet and Vinter, 1998)
 Lexicon-based
model incorporating principles of
lexical competition and memory decay.
 s(w) = P(w) as defined by model.

GGJ (our Bayesian model)
 s(w)
= P(w) as defined by model.
Results: linear fit
Continuing work
Comparisons to human data and other models:
 Which words/categories are most robust?
 Compare
to Frequent Frames predictions (Mintz, 2003).
 Compare to corpus data from children’s production.
Modeling cue combination:
 Integrate morphology into syntactic model.
 Model experimental work on cue combination in
category learning.