Distributional Cues to Word Boundaries: Context Is Important
Download
Report
Transcript Distributional Cues to Word Boundaries: Context Is Important
Distributional Cues to
Word Boundaries:
Context Is Important
Sharon Goldwater
Tom Griffiths
Mark Johnson
Stanford University
UC Berkeley
Microsoft Research/
Brown University
Word segmentation
One of the first problems infants must
solve when learning language.
Infants make use of many different cues.
Phonotactics,
allophonic variation, metrical (stress)
patterns, effects of coarticulation, and statistical
regularities in syllable sequences.
Statistics may provide initial bootstrapping.
Used
very early (Thiessen & Saffran, 2003).
Language-independent.
Distributional segmentation
Work on distributional segmentation
often discusses transitional probabilities
(Saffran et al. 1996; Aslin et al. 1998, Johnson &
Jusczyk, 2001).
What do TPs have to say about words?
A word is a unit whose beginning predicts its
end, but it does not predict other words.
Or… 2. A word is a unit whose beginning predicts its
end, and it also predicts future words.
1.
Interpretation of TPs
Most previous work assumes words are
statistically independent.
Experimental
work: Saffran et al. (1996),
many others.
tupiro
golabu
bidaku
padoti
golabubidakugolabutupiropadotibidakupadotit
upirobidakugolabutupiropadotibidakutupiro…
Computational
work: Brent (1999).
What about words predicting other words?
Questions
If a learner assumes that words are independent
units, what is learned (from more realistic input)?
What if the learner assumes that words are units
that help predict other units?
Approach: use a Bayesian “ideal observer” model
to examine the consequences of making these
different assumptions. What kinds of words are
learned?
Two kinds of models
Unigram model: words are independent.
Generate
a sentence by generating each
word independently.
look
that
at
…
look
.1
.2
.4
look
that
at
…
.1
.2
.4
at
look
that
at
…
that
.1
.2
.4
Two kinds of models
Bigram model: words predict other words.
Generate
a sentence by generating each
word, conditioned on the previous word.
look
that
at
…
look
.4
.2
.1
look
that
at
…
.1
.3
.5
at
look
that
at
…
that
.1
.5
.1
Bayesian learning
The Bayesian learner seeks to identify an
explanatory linguistic hypothesis that
accounts
for the observed data.
conforms to prior expectations.
Focus is on the goal of computation, not
the procedure (algorithm) used to achieve
the goal.
Bayesian segmentation
In the domain of segmentation, we have:
Data:
unsegmented corpus (transcriptions).
Hypotheses: sequences of word tokens.
= 1 if concatenating words forms corpus,
= 0 otherwise.
Encodes unigram or bigram
assumption (also others).
Optimal solution is the segmentation with
highest prior probability.
Brent (1999)
Describes a Bayesian unigram model for
segmentation.
Prior
favors solutions with fewer words,
shorter words.
Problems with Brent’s system:
Learning
algorithm is approximate (non-
optimal).
Difficult to extend to incorporate bigram info.
A new unigram model
Assumes word wi is generated as follows:
1. Is wi a novel lexical item?
P ( yes )
n
n
P ( no )
n
Fewer word types =
Higher probability
A new unigram model
Assume word wi is generated as follows:
2. If novel, generate phonemic form x1…xm :
m
P( wi x1...xm ) P( xi )
i 1
Shorter words =
Higher probability
If not, choose lexical identity of wi from
previously occurring words:
count (l )
P( wi l )
n
Power law =
Higher probability
Advantages of our model
Unigram?
Bigram?
Algorithm?
Brent
P
O
O
GGJ
P
P
P
Unigram model: simulations
Same corpus as Brent:
9790
utterances of phonemically transcribed
child-directed speech (19-23 months).
Average utterance length: 3.4 words.
Average word length: 2.9 phonemes.
Example input:
yuwanttusiD6bUk
lUkD*z6b7wIThIzh&t
&nd6dOgi
yuwanttulUk&tDIs
...
Example results
Comparison to previous results
Proposed boundaries are more accurate than
Brent’s, but fewer proposals are made.
Boundary
Precision
Boundary
Recall
Brent
.80
.85
GGJ
.92
.62
Precision: #correct / #found
Recall:
#found / #true
Result: word tokens are less accurate.
Token F-score
Brent
.68
GGJ
.54
F-score: an average of
precision and recall.
What happened?
Model assumes (falsely) that words have
the same probability regardless of context.
P(D&t) = .024
P(D&t|WAts) = .46
P(D&t|tu) = .0019
Positing amalgams allows the model to
capture word-to-word dependencies.
What about other unigram models?
Brent’s learning algorithm is insufficient to
identify the optimal segmentation.
Our
solution has higher probability under his
model than his own solution does.
On randomly permuted corpus, our system
achieves 96% accuracy; Brent gets 81%.
Formal analysis shows undersegmentation
is the optimal solution for any (reasonable)
unigram model.
Bigram model
Assume word wi is generated as follows:
1.
Is (wi-1,wi) a novel bigram?
nw
P( yes)
P ( no )
nw
nw
i 1
i 1
2.
i 1
If novel, generate wi using unigram model.
If not, choose lexical identity of wi from
words previously occurring after wi-1.
count(l ' , l )
P( wi l | wi 1 l ' )
count(l ' )
Example results
Quantitative evaluation
Compared to unigram model, more boundaries
are proposed, with no loss in accuracy:
Boundary
Precision
Boundary
Recall
GGJ (unigram)
.92
.62
GGJ (bigram)
.92
.84
Accuracy is higher than previous models:
Token F-score
Brent (unigram)
GGJ (bigram)
.68
.77
Type F-score
.52
.63
Conclusion
Different assumptions about what defines
a word lead to different segmentations.
Beginning
of word predicts end of word:
Optimal solution undersegments, finding
common multi-word units.
Word also predicts next word:
Segmentation is more accurate, adult-like.
Important to consider how transitional
probabilities and other statistics are used.
Constraints on learning
Algorithms can impose implicit constraints.
Implication:
learning process prevents the
learner from identifying the best solutions.
Specifics of algorithm are critical, but hard to
determine their effect.
Prior imposes explicit constraints.
State
general expectations about the nature
of language.
Assume humans are good at learning.
Algorithmic constraints
Venkataraman (2001) and Batchelder
(2002) describe unigram model-based
approaches to segmentation, with no prior.
Venkataraman
algorithm penalizes novel
words.
Batchelder algorithm penalizes long words.
Without algorithmic constraints, these
models would memorize every utterance
whole (insert no word boundaries).
Remaining questions
Are multi-word chunks sufficient as an
initial bootstrapping step in humans?
(cf. Swingley, 2005)
Do children go through a stage with many
chunks like these?
(cf. MacWhinney, ??)
Are humans able to segment based on
bigram statistics?