Distributional Cues to Word Boundaries: Context Is Important

Transcript Distributional Cues to Word Boundaries: Context Is Important

Distributional Cues to
Word Boundaries:
Context Is Important
Sharon Goldwater
Tom Griffiths
Mark Johnson
Stanford University
UC Berkeley
Microsoft Research/
Brown University
Word segmentation
One of the first problems infants must
solve when learning language.
 Infants make use of many different cues.

 Phonotactics,
allophonic variation, metrical (stress)
patterns, effects of coarticulation, and statistical
regularities in syllable sequences.

Statistics may provide initial bootstrapping.
 Used
very early (Thiessen & Saffran, 2003).
 Language-independent.
Distributional segmentation

Work on distributional segmentation
often discusses transitional probabilities
(Saffran et al. 1996; Aslin et al. 1998, Johnson &
Jusczyk, 2001).

What do TPs have to say about words?
A word is a unit whose beginning predicts its
end, but it does not predict other words.
Or… 2. A word is a unit whose beginning predicts its
end, and it also predicts future words.
1.
Interpretation of TPs

Most previous work assumes words are
statistically independent.
 Experimental
work: Saffran et al. (1996),
many others.
tupiro
golabu
bidaku
padoti
golabubidakugolabutupiropadotibidakupadotit
upirobidakugolabutupiropadotibidakutupiro…
 Computational

work: Brent (1999).
What about words predicting other words?
Questions


If a learner assumes that words are independent
units, what is learned (from more realistic input)?
What if the learner assumes that words are units
that help predict other units?
Approach: use a Bayesian “ideal observer” model
to examine the consequences of making these
different assumptions. What kinds of words are
learned?
Two kinds of models

Unigram model: words are independent.
 Generate
a sentence by generating each
word independently.
look
that
at
…
look
.1
.2
.4
look
that
at
…
.1
.2
.4
at
look
that
at
…
that
.1
.2
.4
Two kinds of models

Bigram model: words predict other words.
 Generate
a sentence by generating each
word, conditioned on the previous word.
look
that
at
…
look
.4
.2
.1
look
that
at
…
.1
.3
.5
at
look
that
at
…
that
.1
.5
.1
Bayesian learning

The Bayesian learner seeks to identify an
explanatory linguistic hypothesis that
 accounts
for the observed data.
 conforms to prior expectations.

Focus is on the goal of computation, not
the procedure (algorithm) used to achieve
the goal.
Bayesian segmentation

In the domain of segmentation, we have:
 Data:
unsegmented corpus (transcriptions).
 Hypotheses: sequences of word tokens.
= 1 if concatenating words forms corpus,
= 0 otherwise.

Encodes unigram or bigram
assumption (also others).
Optimal solution is the segmentation with
highest prior probability.
Brent (1999)

Describes a Bayesian unigram model for
segmentation.
 Prior
favors solutions with fewer words,
shorter words.

Problems with Brent’s system:
 Learning
algorithm is approximate (non-
optimal).
 Difficult to extend to incorporate bigram info.
A new unigram model
Assumes word wi is generated as follows:
1. Is wi a novel lexical item?
P ( yes ) 

n 
n
P ( no ) 
n 
Fewer word types =
Higher probability
A new unigram model
Assume word wi is generated as follows:
2. If novel, generate phonemic form x1…xm :
m
P( wi  x1...xm )   P( xi )
i 1
Shorter words =
Higher probability
If not, choose lexical identity of wi from
previously occurring words:
count (l )
P( wi  l ) 
n
Power law =
Higher probability
Advantages of our model
Unigram?
Bigram?
Algorithm?
Brent
P
O
O
GGJ
P
P
P
Unigram model: simulations

Same corpus as Brent:
 9790
utterances of phonemically transcribed
child-directed speech (19-23 months).
 Average utterance length: 3.4 words.
 Average word length: 2.9 phonemes.

Example input:
yuwanttusiD6bUk
lUkD*z6b7wIThIzh&t
&nd6dOgi
yuwanttulUk&tDIs
...
Example results
Comparison to previous results

Proposed boundaries are more accurate than
Brent’s, but fewer proposals are made.
Boundary
Precision

Boundary
Recall
Brent
.80
.85
GGJ
.92
.62
Precision: #correct / #found
Recall:
#found / #true
Result: word tokens are less accurate.
Token F-score
Brent
.68
GGJ
.54
F-score: an average of
precision and recall.
What happened?

Model assumes (falsely) that words have
the same probability regardless of context.
P(D&t) = .024

P(D&t|WAts) = .46
P(D&t|tu) = .0019
Positing amalgams allows the model to
capture word-to-word dependencies.
What about other unigram models?

Brent’s learning algorithm is insufficient to
identify the optimal segmentation.
 Our
solution has higher probability under his
model than his own solution does.
 On randomly permuted corpus, our system
achieves 96% accuracy; Brent gets 81%.

Formal analysis shows undersegmentation
is the optimal solution for any (reasonable)
unigram model.
Bigram model
Assume word wi is generated as follows:
1.
Is (wi-1,wi) a novel bigram?
nw

P( yes) 
P ( no ) 
nw 
nw 
i 1
i 1
2.
i 1
If novel, generate wi using unigram model.
If not, choose lexical identity of wi from
words previously occurring after wi-1.
count(l ' , l )
P( wi  l | wi 1  l ' ) 
count(l ' )
Example results
Quantitative evaluation

Compared to unigram model, more boundaries
are proposed, with no loss in accuracy:
Boundary
Precision

Boundary
Recall
GGJ (unigram)
.92
.62
GGJ (bigram)
.92
.84
Accuracy is higher than previous models:
Token F-score
Brent (unigram)
GGJ (bigram)
.68
.77
Type F-score
.52
.63
Conclusion

Different assumptions about what defines
a word lead to different segmentations.
 Beginning
of word predicts end of word:
Optimal solution undersegments, finding
common multi-word units.
 Word also predicts next word:
Segmentation is more accurate, adult-like.

Important to consider how transitional
probabilities and other statistics are used.
Constraints on learning

Algorithms can impose implicit constraints.
 Implication:
learning process prevents the
learner from identifying the best solutions.
 Specifics of algorithm are critical, but hard to
determine their effect.

Prior imposes explicit constraints.
 State
general expectations about the nature
of language.
 Assume humans are good at learning.
Algorithmic constraints

Venkataraman (2001) and Batchelder
(2002) describe unigram model-based
approaches to segmentation, with no prior.
 Venkataraman
algorithm penalizes novel
words.
 Batchelder algorithm penalizes long words.

Without algorithmic constraints, these
models would memorize every utterance
whole (insert no word boundaries).
Remaining questions

Are multi-word chunks sufficient as an
initial bootstrapping step in humans?
(cf. Swingley, 2005)

Do children go through a stage with many
chunks like these?
(cf. MacWhinney, ??)

Are humans able to segment based on
bigram statistics?

Distributional Cues to Word Boundaries: Context Is Important

Transcript Distributional Cues to Word Boundaries: Context Is Important

Directory