Transcript Slide

Speech and Language Processing:
Where have we been
and where are we going?
Kenneth Ward Church
[email protected]
Where have we been?
How To Cook A Demo
(After Dinner Talk at TMI-1992 & Invited Talk at TMI-2002)
• Great fun!
• Effective demos
Message for
After Dinner Talk
– Theater, theater, theater
– Production quality matters
– Entertainment >> evaluation
– Strategic vision >> technical correctness
• Success/Catastrophe
Message for
After Breakfast Talk
– Warning: demos can be too effective
– Dangerous to raise unrealistic expectations
Brno 2004
2
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Brno 2004
3
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
Classic example of a demo  embarrassment in retrospect
Brno 2004
4
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
Though well understood by research community
Brno 2004
5
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
3.
Though well understood by research community
Apple (~1990) video
–
–
Still having trouble setting appropriate expectations
Factoid: the day of this demo, speech recognition deployed at scale
in AT&T network – with significant lasting impact – but little media
Brno 2004
6
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
3.
Apple (~1990) video
–
–
4.
Though well understood by research community
Still having trouble setting appropriate expectations
Factoid: the day of this demo, speech recognition deployed at scale
in AT&T network – with significant lasting impact – but little media
Andy Rooney (~1990): reset expectations video
Brno 2004
7
Charles Wayne’s Challenge:
Demonstrate Consistent Progress Over Time
Managing
Expectations
•
Controversial in 1980s
–
–
•
But not in 1990s
Though, lgrumbling
Benefits
1. Agreement on what to do
2. Limits endless discussion
3. Helps sell the field
•
•
•
Manage expectations
Fund raising
Risks (similar to benefits)
1. All our eggs are in one
basket (lack of diversity)
2. Not enough discussion
•
Hard to change course
3. Methodology  Burden
Brno 2004
8
$
Hockey Stick
Business Case
2003
Last
Year
2004
This
Year
Brno 2004
t
2005
Next
Year
9
Moore’s Law: Ideal Answer
Where have we been and where are we going?
Brno 2004
10
Error Rate
Borrowed Slide
Audrey Le (NIST)
Moore’s Law Time Constant:
• 10x improvement per decade
Date (15 years)
Brno 2004
11
Milestones in Speech and Multimodal
Technology Research
Borrowed
Slide
Small
Vocabulary,
Acoustic
Phoneticsbased
Isolated
Words
Filter-bank
analysis;
Timenormalization
;Dynamic
programming
1962
Medium
Large
Vocabulary,
Vocabulary,
Template-based Statistical-based
Isolated Words;
Connected Digits;
Continuous
Speech
Pattern
recognition; LPC
analysis;
Clustering
algorithms; Level
building;
1967
1972
Connected
Words;
Continuous
Speech
Continuous
Speech; Speech
Understanding
Hidden Markov
models;
Stochastic
Language
modeling;
Stochastic language
understanding;
Finite-state
machines;
Statistical learning;
1977
1982
Very Large
Vocabulary;
Semantics,
Multimodal
Dialog, TTS
Large
Vocabulary;
Syntax,
Semantics,
1987
1992
Spoken dialog;
Multiple
modalities
Concatenative
synthesis; Machine
learning; Mixedinitiative dialog;
1997
2002
Year
Consistent improvement over time, but unlike Moore’s
Law, hard to extrapolate (predict future)
Brno 2004
12
Speech-Related Technologies
Where will the field go in 10 years?
Niels Ole Bernsen (ed)
2003 Useful speech recognition-based language tutor
2003 Useful portable spoken sentence translation systems
2003 First pro-active spoken dialogue with situation awareness
2004 Satisfactory spoken car navigation systems
2005
Small-vocabulary (> 1000 words)
spoken conversational systems
2006
Multiple-purpose personal assistants
(spoken dialog, animated characters)
2006 Task-oriented spoken translation systems for the web
2006 Useful speech summarization systems in top languages
2008 Useful meeting summarization systems
2010 Medium-size vocabulary conversational systems
Brno 2004
13
Where have we been and where are we going?
Manage
Consistent Progress over Time
Expectations
Extrapolation/Prediction
is Not Applicable
$
Extrapolation/Prediction
is Applicable
2002
2003
2004
t
Brno 2004
14
Outline
1. We’re making consistent progress, or
2. We’re running around in circles, or
3. We’re going off a cliff…
We are here
Brno 2004
15
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Brno 2004
16
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Brno 2004
17
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Brno 2004
18
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
• Periodic signals are continuous
• Support extrapolation/prediction
• Progress? Consistent progress?
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
Consistent progress?
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Brno 2004
Extrapolation/Prediction: Applicable?
19
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
– Has too much happened since TMI-1992?
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility
•
that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Brno 2004
20
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Brno 2004
21
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Brno 2004
22
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
– Statistics and Machine Learning
– as well as Linguistic Theory
• History repeats itself:
–
–
–
–
1950s: empiricism
1970s: rationalism (empiricist methodology became too burdensome)
1990s: empiricism
2010s: rationalism (empiricist methodology is burdensome, again)
Grandparents and grandchildren
has a natural alliance
Brno 2004
23
Rationalism
Well-known
Chomsky, Minsky
advocates
Model Competence Model
Contexts of Interest Phrase-Structure
Goals
Empiricism
Shannon, Skinner, Firth,
Harris
Noisy Channel Model
N-Grams
All and Only
Minimize Prediction Error
(Entropy)
Explanatory
Descriptive
Theoretical
Applied
Linguistic Agreement & WhGeneralizations
movement
Principle-Based,
Parsing Strategies
CKY (Chart),
ATNs, Unification
Understanding
Applications Who did what to
whom
Brno 2004
Collocations & Word
Associations
Forward-Backward
(HMMs), Inside-outside
(PCFGs)
Recognition
Noisy Channel Applications
24
Revival of Empiricism:
A Personal Perspective
•
As a student at MIT, I was solidly opposed to empiricism
– But that changed soon after moving to AT&T Bell Labs (1983)
•
Letter-to-Sound Rules (speech synthesis)
Letter-to-sound
rules  Dict
– Names (~1985): Letter stats  Etymology  Pronunciation video
•
•
Part of Speech Tagging (1988)
Word Associations (Hanks)
Case-based reasoning:
– Corpus-based lexicography: Empirical, but not statistical The best inference is
table lookup
• Collocations: Strong tea v. powerful computers
Lexicography
• Word Associations: bread and butter, doctor/nurse
– Contribution: adding stats
• Mutual info  collocations & word associations
• Pr(doctor…nurse) >> Pr(doctor) Pr(nurse)
•
Good-Turing Smoothing (Gale):
Statistics
– Estimate probability of something you haven’t seen (whales)
•
•
Aligning Parallel Corpora: inspired by Machine Translation (MT)
Word Sense Disambiguation (river bank v. money bank)
– Bilingual  Monolingual (Yarowsky)
•
Even if IBM’s stat-based approach fails for Machine Translation  lasting
benefit (tools, linguistic resources, academic contributions to machine
learning)
Brno 2004
Played well at TMI-2002
25
Speech  Language
Shannon’s: Noisy Channel Model
Channel
Model
Language
Model
• I  Noisy Channel  O
• I΄ ≈ ARGMAXI Pr(I|O) = ARGMAXI Pr(I) Pr(O|I)
Application
Independent
Trigram Language Model
Word
Rank
We
9
The This One Two A Three
Please In
need
7
are will the would also do
to
1
resolve
85
all
9
The This One Two A Three
Please In
of
2
The This One Two A Three
Please In
the
1
important
657
issues
14
More likely alternatives
have know do…
Channel Model
Application
Input
Output
Speech Recognition
writer
rider
OCR (Optical
Character
Recognition)
all
a1l
Spelling Correction
government
goverment
document question first…
thing point to
Brno 2004
26
Speech  Language:
Using (Abusing) Shannon’s Noisy Channel Model
• Speech
– Words  Noisy Channel  Acoustics
• OCR
– Words  Noisy Channel  Optics
• Spelling Correction
– Words  Noisy Channel  Typos
• Part of Speech Tagging (POS):
– POS  Noisy Channel  Words
• Machine Translation: “Made in America”
– English  Noisy Channel  French
Brno 2004
27
Recent work
The chance of Two Noriegas is Closer to p/2 than p2:
Implications for Language Modeling, Information Retrieval and Gzip
• Standard indep models (Binomial, Multinomial, Poisson):
– Chance of 1st Noriega is p
– Chance of 2nd is also p
• Repetition is very common
– Ngrams/words (and their variant forms) appear in bursts
– Noriega appears several times in a doc, or not at all.
• Adaptation & Contagious probability distributions
• Discourse structure (e.g., text cohesion, given/new):
– 1st Noriega in a document is marked (more surprising)
– 2nd is unmarked (less surprising)
• Empirically, we find first Noriega is surprising (p≈6/1000)
– But chance of two is not surprising (closer to p/2 than p2)
• Finding a rare word like Noriega is like lightning
– We might not expect lightning to strike twice in a doc
– But it happens all the time, especially for good keywords
• Documents ≠ Random Bags of Words
Brno 2004
28
Three Applications & Independence Assumptions:
No Quantity Discounts
• Compression: Huffman Coding
– |encoding(s)| = ceil(−log2 Pr(s))
– Two Noriegas consume twice as much space as one
• |encoding(s s)| = |encoding(s)| + |encoding(s)|
– No quantity discount
• Indep is the worst case: any dependencies  less H (space)
• Information Retrieval
– Score(query, doc) = ∑term in doc tf(term, doc) idf(term)
• idf(term): inverse doc freq: −log2 Pr(term) = −log2 df(term)/D
• tf(term, doc): number of instances of term in doc
– Two Noriegas are twice as surprising as one (2 idf v. idf)
– No quantity discount: any dependencies  less surprise
• Speech Recognition, OCR, Spelling Correction
– I  Noisy Channel  O
– Pr(I) Pr(O|I)
– Pr(I) = Pr(w1, w2 … wn) ≈ ∏k Pr(wk|wk-2, wk-1)
Brno 2004
Log tf
smoothing
29
Interestingness Metrics:
Deviations from Independence
• Poisson (and other indep assumptions)
– Not bad for meaningless random strings
• Deviations from Poisson are clues for
hidden variables
– Meaning, content, genre, topic, author, etc.
• Analogous to mutual information (Hanks)
– Pr(doctor…nurse) >> Pr(doctor) Pr(nurse)
Brno 2004
30
Brno 2004
31
Brno 2004
32
Brno 2004
33
Poisson Mixtures: More Poissons  Better Fit
(Interpretation: Each Poisson is conditional on hidden
variables: meaning, content, genre, topic, author, etc.)
Brno 2004
34
Adaptation: Three Approaches
1. Cache-based adaptation
Pr( w | ...)   Prlocal ( w | ...)  (1   ) Prglobal ( w | ...)
2. Parametric Models
– Poisson, Two Poisson,
Mixtures (neg binomial)
Pr( k  2 | k  1) 
1  Pr(1)  Pr(0)
1  Pr(0)
3. Non-parametric
– Pr(+adapt1) ≡ Pr(test|hist)
– Pr(+adapt2) ≡ Pr(k≥2|k ≥1)
Brno 2004
35
Positive & Negative Adaptation
• Adaptation:
– How do probabilities change as we read a doc?
• Intuition: If a word w has been seen recently
1. +adapt: prob of w (and its friends) goes way up
2. −adapt: prob of many other words goes down a little
• Pr(+adapt) >> Pr(prior) > Pr(−adapt)
Brno 2004
36
Adaptation: Method 1
• Split each document
into two equal pieces:
Documents
containing hostages
in 1990 AP News
– Hist: 1st half of doc
– Test: 2nd half of doc
• Task:
– Given hist
– Predict test
test
• Compute contingency
table for each word
hist
Brno 2004
638
505
557
76,787
37
Adaptation: Method 1
test
hist a
• Notation
– D = a+b+c+d (library)
– df = a+b+c (doc freq)
c
• Prior:
• +adapt
• −adapt
d
Documents containing
hostages
in 1990 AP News
ac
Pr( w  test ) 
D
Pr( w  test | w  hist ) 
b
a
ab
Pr(+adapt) >> Pr(prior) > Pr(−adapt)
+adapt prior
−adapt source
0.56 0.014 0.0069 AP 1987
0.56 0.015 0.0072 AP 1990
c
Pr( w  test | w  hist ) 
cd
0.59 0.013 0.0057 AP 1991
0.39 0.004 0.0030 AP 1993
Brno 2004
38
Priming, Neighborhoods and Query Expansion
•
Priming: doctor/nurse
– Doctor in hist  Pr(Nurse in test) ↑
•
Find docs near hist (IR sense)
– Neighborhood ≡ set of words in docs
near hist (query expansion)
•
Partition vocabulary into three sets:
1. Hist: Word in hist
2. Near: Word in neighborhood − hist
3. Other: None of the above
•
•
•
•
test
hist a
b
c
d
ae g
test
D
a
hist a
+adapt Pr( w  test | w  hist ) 
a  b near e
e
Near Pr( w  test | w  near ) 
e  f other g
g
Other Pr( w  test | w  other ) 
gh
Prior:
Pr( w  test ) 
Brno 2004
b
f
h
39
Adaptation: Hist >> Near >> Prior
• Magnitude is huge
– p/2 >> p2
– Two Noriegas are not
much more surprising
than one
– Huge quantity discounts
• Shape: Given/new
– 1st mention: marked
• Surprising (low prob)
• Depends on freq
– 2nd: unmarked
• Less surprising
• Independent of freq
– Priming:
• “a little bit” marked
Brno 2004
40
Adaptation is Lexical
• Lexical: adaptation is
– Stronger for good keywords (Kennedy)
– Than random strings, function words (except), etc.
• Content ≠ low frequency
+adapt
prior
−adapt
source
word
0.27
0.012
0.0091
AP90
Kennedy
0.40
0.015
0.0084
AP91
Kennedy
0.32
0.014
0.0094
AP93
Kennedy
0.049
0.016
0.016
AP90
except
0.048
0.014
0.014
AP91
except
0.048
0.012
0.012
AP93
except
Brno 2004
41
Adaptation: Method 2
• Pr(+adapt2)
df 2
Pr( k  2 | k  1) 
df1
• dfk(w) ≡ number of
documents that
– mention word w
– at least k times
• df1(w) ≡ standard def
of document freq (df)
Brno 2004
42
Pr(+adapt1) ≈ Pr(+adapt2)
Within factors of 2-3 (as opposed to 10-1000)
3rd mention
Priming
Brno 2004
43
Adaptation helps more than it hurts
Hist is a great clue
• Examples of big winners (Boilerplate)
– Lists of major cities and their temperatures
– Lists of major currencies and their prices
– Lists of commodities and their prices
– Lists of senators and how they voted
• Examples of big losers
Hist is misleading
– Summary articles
– Articles that were garbled in transmission
Brno 2004
44
Recent Work (with Kyoji Umemura)
• Applications: Japanese Morphology (text  words)
– Standard methods: dictionary-based
– Challenge: OOV (out of vocabulary)
– Good keywords (OOV) adapt more than meaningless fragments
• Poisson model: not bad for meaningless random strings
• Adaptation (deviations from Poisson): great clues for hidden variables
– OOV, good keywords, technical terminology, meaning, content, genre,
author, etc.
– Extend dictionary method to also look for substrings that adapt a lot
• Practical procedure for counting dfk(s) for all substrings s in
a large corpus (trigrams  million grams)
– Suffix array: standard method for computing freq and loc for all s
– Yamamoto & Church (2001): count df for all ngrams in large corpus
• df (and many other ngram stats) for million-grams
• Although there are too many ngrams to work with (n2)
– They can be grouped into a manageable number of equiv classes (n)
– Where all substrings in a class share the same stats
– Umemura (submitted): generalize method for dfk
• Adaptation for million-grams
Brno 2004
45
struct stackframe {
int start, SIL, cdfk } *stack;
int kth_neighbor(int suffix, int k)
{ int i, result = suffix;
for(i=0; i < k && result >= 0; i++)
result = neighbors[result];
return result; }
int find(int suffix)
{ int low = 0;
int high = sp;
while(low + 1 < high) {
int mid = (low + high) / 2;
if(stack[mid].start <= suffix) low
= mid;
else high = mid; }
if(stack[high].start <= suffix)
return high;
if(stack[low].start <= suffix)
return low;
fatal("can't get here"); }
The Solution
(dfk for all ngrams)
for(w=0; w<N; w++) {
if(LCP[w]> stack[sp].SIL) {
sp++;
stack[sp].start = w;
stack[sp].SIL = LCP[w];
stack[sp].cdfk = 0; }
int prev = kth_neighbor(w, K-1);
if(prev >= 0)
stack[find(prev)].cdfk++;
while(LCP[w] < stack[sp].SIL) {
putw(stack[sp].cdfk, out); /*
report */
if(LCP[w] <= stack[sp-1].SIL) {
stack[sp-1].cdfk +=
stack[sp].cdfk;
sp--; }
Brno 2004else stack[sp].SIL = LCP[w]; }}
46
App: Word Breaking & Term Extraction
Challenge: Establish value beyond standard corpus freq and trigrams
• No spaces
• English
– in Japanese and Chinese
– Kim Dae Jung
before Presidency
• English has spaces
– But…
• Phrases: white house
• NER (named entity recog)
• Japanese
– 大統領になる以前の
金大中
• Chinese
–未上任前的金大中
• Word Breaking
– Dictionary-based (ChaSen)
• Dynamic Programming
• Fewest edges (dictionary
entries) that cover input
• Challenges for Dictionary
– Out-of-Vocabulary (OOV)
– Technical Terminology
– Proper Nouns
Brno 2004
47
Using Adaptation to Distinguish
Terms from Random Fragments
• Adaptation: Pr(k≥2|k≥1)  df2 / df1
• Null hypothesis
– Poisson: df2 / df1  df1 / D
– Not bad for random fragments
• OOV (and substrings thereof)
– Adapt too much for null hypothesis (Poisson)
– If an OOV word is mentioned once in a document, it
will probably be mentioned again
– Not true for random fragments
Brno 2004
48
word boundary
Using
Adaptation
to Reject
Null Hypo
Japanese
English Gloss
フジモリ
Fujimori
大統領
President
が
<function word>
Brno 2004
49
English Example
Adaptation
Doc Freq
Baseline
Brno 2004
50
Adaptation
Conclusions
1. Large magnitude (p/2 >> p2); big quantity discounts
2. Distinctive shape
•
1st mention depends on freq
–
•
2nd does not
Priming: between 1st mention and 2nd
3. Lexical:
–
–
Independence assumptions aren’t bad for meaningless
random strings, function words, common first names, etc.
More adaptation for content words (good keywords, OOV)
Brno 2004
51
Outline
1. We’re making consistent progress, or
2. We’re running around in circles, or
•
Don’t worry, be happy
We are here
3. We’re going off a cliff…
Brno 2004
52
The rising tide of data will lift all boats!
TREC Question Answering & Google:
What is the highest point on Earth?
Brno 2004
53
The rising tide of data will lift all boats!
Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc.
http://labs1.google.com/sets
England
Japan
Cat
cat
France
Germany
Italy
Ireland
China
India
Indonesia
Malaysia
Dog
Horse
Fish
Bird
more
ls
rm
mv
Spain
Scotland
Belgium
Korea
Taiwan
Thailand
Rabbit
Cattle
Rat
cd
cp
mkdir
Canada
Austria
Australia
Singapore
Australia
Bangladesh
Livestock
Mouse
Human
man
tail
pwd
Brno 2004
54
Rising Tide of Data Lifts all Boats
• More data  better results
– TREC Question Answering
• Remarkable performance: Google
and not much else
– Norvig (ACL-02)
– AskMSR (SIGIR-02)
– Lexical Acquisition
• Google Sets
– Hanks and I tried similar things
» but with tiny corpora
» which we called large
Brno 2004
55
Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
According to unnamed sources:
Speech Winter  Language Winter
Dot Boom  Dot Bust
Brno 2004
56
What is the answer to all
questions?
• 6 years
Brno 2004
57
% Statistical
Papers
When will we see the last nonstatistical paper? 2010?
100%
80%
60%
40%
20%
0%
2005
2000
1995
1990
1985
ACL Meeting
Bob Moore
Fred Jelinek
Brno 2004
58
Covering all the Bases
It is hard to make predictions (especially about the future)
• When will we see the last non-statistical paper?
2010?
• Revival of rationalism: 2010?
– 1950s: Empiricism
• Information Theory, Behaviorism
– 1970s: Rationalism
• AI, Cognitive Psychology
– 1990s: Empiricism
• Data Mining, Statistical NLP, Speech
– 2010s: Rationalism
• TBD
Brno 2004
59
Brno 2004
60
Brno 2004
61
Sample of 20 Survey Questions
(Strong Emphasis on Applications)
• When will
– More than 50% of new PCs have dictation on them, either at
purchase or shortly after.
– Most telephone Interactive Voice Response (IVR) systems
accept speech input.
– Automatic airline reservation by voice over the telephone is the
norm.
– TV closed-captioning (subtitling) is automatic and pervasive.
– Telephones are answered by an intelligent answering machine
that converses with the calling party to determine the nature and
priority of the call.
– Public proceedings (e.g., courts, public inquiries, parliament,
etc.) are transcribed automatically.
• Two surveys of ASRU attendees: 1997 & 2003
Brno 2004
62
2003 Responses ≈ 1997 Responses + 6 Years
(6 years of hard work  No progress)
Brno 2004
63
Wrong Apps?
• New Priorities
• Old Priorities
– Dictation app dates back to
days of dictation machines
– Speech recognition has not
displaced typing
– Increase demand for
space >> Data entry
• New Killer Apps
• Speech recognition has
improved
• But typing skills have
improved even more
– Search >> Dictation
• Speech Google!
– Data mining
– My son will learn typing in
1st grade
– Sec rarely take dictation
– Dictation machines are history
• My son may never see one
• Museums have slide rulers
and steam trains
– But dictation machines?
Brno 2004
64
Speech Data Mining
& Call Centers:
An Intelligence Bonanza
• Some companies are collecting
information with technology
designed to monitor incoming calls
for service quality.
• Last summer, Continental Airlines
Inc. installed software from
Witness Systems Inc. to monitor
the 5,200 agents in its four
reservation centers.
• But the Houston airline quickly
realized that the system, which
records customer phone calls and
information on the responding
agent's computer screen, also was
an intelligence bonanza, says
André Harris, reservations training
and quality-assurance director.
Brno 2004
65
Speech Data Mining
• Label calls as success or failure based on
some subsequent outcome (sale/no sale)
• Extract features from speech
• Find patterns of features that can be used
to predict outcomes
• Hypotheses:
– Customer: “I’m not interested”  no sale
– Agent: “I just want to tell you…”  no sale
Inter-ocular effect (hits you between the eyes);
Don’t need a statistician to know which way the wind is blowing
Brno 2004
66
Borrowed Slide: Jelinek (LREC)
Great Strategy  Success
Great Challenge: Annotating Data
• Produce annotated data with minimal
supervision
Self-organizing “Magic”?
• Active learning
– Identify reliable labels
– Identify best candidates for annotation
• Co-training
• Bootstrap (project) resources from one
application to another
Brno 2004
67
Grand Challenges
ftp://ftp.cordis.lu/pub/ist/docs/istag040319-draftnotesofthemeeting.pdf
Brno 2004
68
Roadmaps: Structure of a Strategy
(not the union of what we are all doing)
•
Goals
– Example: Replace keyboard with
microphone
– Exciting (memorable) sound bite
– Broad grand challenge that we
can work toward but never solve
•
Metrics
– Examples:
•
– Quantity is not a good thing
– Awareness
– 1-slide version
• if successful, you get maybe 3
more slides
•
– Easy to measure
•
• Mostly for next year: Q1-4
• Plus some for years 2, 5, 10 & 20
Milestones
– Should be no question if it has
been accomplished
– Example: reduce WER on task x
by y% by time t
– Accomplishments: a dozen
•
Broad applicability & illustrative
– Don’t cover everything
– Highlight stuff that
Accomplishments v. Activities
• Applies to multiple groups
• Forward-Looking / Exciting
– Accomplishments are good
– Activity is not a substitute for
accomplishments
– Milestones look forward whereas
accomplishments look backward
• Serendipity is good!
Size of container
– Goal: 1-3
– Metrics: 3
– Milestones: a dozen
• WER: word error rate
• Time to perform task
•
Small is beautiful
Brno 2004
69
Grand Challenges
Infrastructure
Brno 2004
70
Goals:
1. The multilingual companion
2. Life log
Grand Challenges
Goal: Produce NLP apps
that improve the way
people communicate
with one another
Goal: Reduce
barriers to entry
€
Apps &
Techniques
Resources
Evaluation
Brno 2004
71
Substance: Recommended if…
Summary: What Worked
the right
and What Didn’t? What’s
answer?
•
Data
–
Stay on msg: It is the data, stupid!
•
If you have a lot of data,
–
•
•
Then you don’t need a lot of methodology
Rising Tide of Data Lifts All Boats
Methodology
–
Empiricism means different things to different people
1.
2.
3.
–
There’ll be a
quiz at the end
of the decade…
Machine Learning (Self-organizing Methods)
Exploratory Data Analysis (EDA)
Corpus-Based Lexicography
Magic: Recommended if…
Lots of papers on 1
•
•
EMNLP-2004 theme (error analysis)  2
Senseval grew out of 3
Short term ≠ Long term
Promise: Recommended if…
Brno 2004
Lonely
72