Original - Kansas State University
Download
Report
Transcript Original - Kansas State University
Lecture 39 of 42
Natural Language Processing (NLP)
Discussion: Machine Translation (MT)
Wednesday, 29 November 2006
William H. Hsu
Department of Computing and Information Sciences, KSU
KSOL course page: http://snipurl.com/v9v3
Course web site: http://www.kddresearch.org/Courses/Fall-2006/CIS730
Instructor home page: http://www.cis.ksu.edu/~bhsu
Reading for Next Class:
Sections 22.1, 22.6-7, Russell & Norvig 2nd edition
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Lecture Outline
Reference: Sections 6.9-6.10, Mitchell
Simple Bayes, aka Naïve Bayes
More examples
Classification: choosing between two classes; general case
Robust estimation of probabilities
Learning in Natural Language Processing (NLP)
Learning over text: problem definitions
Case study: Newsweeder (Naïve Bayes application)
Probabilistic framework
Bayesian approaches to NLP
•
Issues: word sense disambiguation, part-of-speech tagging
•
Applications: spelling correction, web and document searching
Related Material, Mitchell; Pearl
Read: “Bayesian Networks without Tears”, Charniak
Go over Chapter 14, Russell and Norvig; Heckerman tutorial (slides)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Learning Framework for Natural Language:
(Hidden) Markov Models
Definition of Hidden Markov Models (HMMs)
Stochastic state transition diagram (HMMs: states, aka nodes, are hidden)
Compare: probabilistic finite state automaton (Mealy/Moore model)
Annotated transitions (aka arcs, edges, links)
• Output alphabet (the observable part)
• Probability distribution over outputs
A 0.4
B 0.6
E 0.1
F 0.9
Forward Problem: One Step in ML Estimation
Given: model h, observations (data) D
0.4
0.5
0.8
0.6
Estimate: P(D | h)
Backward Problem: Prediction Step
Given: model h, observations D
Maximize: P(h(X) = x | h, D) for a new X
Forward-Backward (Learning) Problem
Given: model space H, data D
1
A 0.5
G 0.3
H 0.2
0.5
2
3
0.2
C 0.8
D 0.2
E 0.3
F 0.7
A 0.1
G 0.9
Find: h H such that P(h | D) is maximized (i.e., MAP hypothesis)
HMMs Also A Case of LSQ (f Values in [Roth, 1999])
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
NLP Issues:
Word Sense Disambiguation (WSD)
Problem Definition
Given: m sentences, each containing a usage of a particular ambiguous word
Example: “The can will rust.” (auxiliary verb versus noun)
Label: vj s correct word sense (e.g., s {auxiliary verb, noun})
Representation: m examples (labeled attribute vectors <(w1, w2, …, wn), s>)
Return: classifier f: X V that disambiguates new x (w1, w2, …, wn)
Solution Approach: Use Bayesian Learning (e.g., Naïve Bayes)
Caveat: can’t observe s in the text!
n
P w
A solution: treat s in P(w | s) as missing value, impute s (assign by inference)
P w 1 , w 2 , , w n | s
i
| s
i 1
i
[Pedersen and Bruce, 1998]: fill in using Gibbs sampling, EM algorithm (later)
[Roth, 1998]: Naïve Bayes, sparse networks of Winnows (SNOW), TBL
Recent Research
T. Pedersen’s research home page: http://www.d.umn.edu/~tpederse/
D. Roth’s Cognitive Computation Group: http://l2r.cs.uiuc.edu/~cogcomp/
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
NLP Issues:
Part-of-Speech (POS) Tagging
Problem Definition
Given: m sentences containing untagged words
Example: “The can will rust.”
Label (one per word, out of ~30-150): vj s (art, n, aux, vi)
Representation: labeled examples <(w1, w2, …, wn), s>
Return: classifier f: X V that tags x (w1, w2, …, wn)
Applications: WSD, dialogue acts (e.g., “That sounds OK to me.” ACCEPT)
Discourse Labeling
Solution Approaches: Use Transformation-Based Learning (TBL)
Speech Acts
[Brill, 1995]: TBL - mistake-driven algorithm that produces sequences of rules
•
Each rule of the form (ti, v): a test condition (constructed attribute) and a tag
•
ti: “w occurs within k words of wi” (context words); collocations (windows)
Parsing / POS Tagging
Lexical Analysis
Natural Language
For more info: see [Roth, 1998], [Samuel, Carberry, Vijay-Shankar, 1998]
Recent Research
E. Brill’s page: http://www.cs.jhu.edu/~brill/
K. Samuel’s page: http://www.eecis.udel.edu/~samuel/work/research.html
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
NLP Applications:
Info Retrieval (IR) and Digital Libraries
Information Retrieval (IR)
One role of learning: produce classifiers for documents (see [Sahami, 1999])
Query-based search engines (e.g., for WWW: AltaVista, Lycos, Yahoo)
Applications: bibliographic searches (citations, patent intelligence, etc.)
Bayesian Classification: Integrating Supervised and Unsupervised Learning
Unsupervised learning: organize collections of documents at a “topical” level
e.g., AutoClass [Cheeseman et al, 1988]; self-organizing maps [Kohonen, 1995]
More on this topic (document clustering) soon
Framework Extends Beyond Natural Language
Collections of images, audio, video, other media
Five Ss : Source, Stream, Structure, Scenario, Society
Book on IR [vanRijsbergen, 1979]: http://www.dcs.gla.ac.uk/Keith/Preface.html
Recent Research
M. Sahami’s page (Bayesian IR): http://robotics.stanford.edu/users/sahami
Digital libraries (DL) resources: http://fox.cs.vt.edu
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Statistical Machine Translation
Kevin Knight
USC/Information Sciences Institute
USC/Computer Science Department
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Machine Translation
美国关岛国际机场及其办公室均接获一
名自称沙地阿拉伯富商拉登等发出的电
子邮件,威胁将会向机场等公众地方发
动生化袭击後,关岛经保持高度戒备。
The U.S. island of Guam is maintaining a high
state of alert after the Guam airport and its offices
both received an e-mail from someone calling
himself the Saudi Arabian Osama bin Laden and
threatening a biological/chemical attack against
public places such as the airport .
The classic acid test for natural language processing.
Requires capabilities in both interpretation and generation.
About $10 billion spent annually on human translation.
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
MT Strategies (1954-2004)
Shallow/ Simple
Word-based
only
Electronic
dictionaries
Phrase tables
Knowledge
Acquisition
Hand-built by
Strategy
experts
Hand-built by
non-experts
All manual
Original direct
approach
Typical transfer
system
Classic
interlingual
system
CIS 490 / 730: Artificial Intelligence
Original statistical
MT
Example-based
MT
Learn from
annotated data
Learn from unannotated data
Fully automated
Syntactic
Constituent
Structure
Semantic
analysis
New Research
Goes Here!
Interlingua
Knowledge
Deep/ Complex Representation
Slide courtesy of
Sciences
Wednesday, 29 Nov 2006Strategy Computing & Information
Laurie Gerber
Kansas State University
Data-Driven Machine Translation
Man, this is so boring.
Hmm, every time he sees
“banco”, he either types
“bank” or “bench” … but if
he sees “banco de…”,
he always types “bank”,
never “bench”…
Translated documents
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Recent Progress in Statistical MT
2002
slide from C. Wayne, DARPA
2003
insistent Wednesday may recurred her trips to
Libya tomorrow for flying
Egyptair Has Tomorrow to Resume Its Flights to
Libya
Cairo 6-4 ( AFP ) - an official announced today
in the Egyptian lines company for flying Tuesday
is a company " insistent for flying " may resumed
a consideration of a day Wednesday tomorrow her
trips to Libya of Security Council decision trace
international the imposed ban comment .
Cairo 4-6 (AFP) - said an official at the Egyptian
Aviation Company today that the company
egyptair may resume as of tomorrow, Wednesday
its flights to Libya after the International Security
Council resolution to the suspension of the
embargo imposed on Libya.
And said the official " the institution sent a speech
to Ministry of Foreign Affairs of lifting on Libya air ,
a situation her receiving replying are so a trip will
pull to Libya a morning Wednesday " .
" The official said that the company had sent a
letter to the Ministry of Foreign Affairs, information
on the lifting of the air embargo on Libya, where it
had received a response, the first take off a trip to
Libya on Wednesday morning ".
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
CIS 490 / 730: Artificial Intelligence
farok crrrok hihok yorok clok kantok ok-yurp
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
???
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
???
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
process of
elimination
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
cognate?
Computing & Information Sciences
Kansas State University
Centauri/Arcturan [Knight, 1997]
Your assignment, put these words in order:
{ jjat, arrat, mat, bat, oloat, at-yurp }
1a. ok-voon ororok sprok .
7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat .
4a. ok-voon anok drok brok jok .
9b. totat nnat quat oloat at-yurp .
10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat .
5a. wiwok farok izok stok .
10b. wat nnat gat mat bat hilat .
11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat .
6a. lalok sprok izok jok stok .
11b. wat nnat arrat mat zanzanat .
12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat .
12b. wat nnat forat arrat vat gat .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
zero
fertility
Computing & Information Sciences
Kansas State University
It’s Really Spanish/English
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa
1a. Garcia and associates .
1b. Garcia y asociados .
7a. the clients and the associates are enemies .
7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .
2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .
8b. la empresa tiene tres grupos .
3a. his associates are not strong .
3b. sus asociados no son fuertes .
9a. its groups are in Europe .
9b. sus grupos estan en Europa .
4a. Garcia has a company also .
4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .
10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .
5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .
11b. los grupos no venden zanzanina .
6a. the associates are also angry .
6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .
12b. los grupos pequenos no son modernos .
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Data for Statistical MT
and data preparation
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Ready-to-Use Online Bilingual Data
140
120
Chinese/English
100
Millions of words 80
(English side)
60
Arabic/English
40
French/English
20
2004
2002
2000
1998
1996
1994
0
(Data stripped of formatting, in sentence-pair format, available
from the Linguistic Data Consortium at UPenn).
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Ready-to-Use Online Bilingual Data
180
160
140
120
Millions of words
100
(English side)
80
60
40
20
0
Chinese/English
Arabic/English
2004
2002
2000
1998
1996
1994
French/English
+ 1m-20m words for
many language pairs
(Data stripped of formatting, in sentence-pair format, available
from the Linguistic Data Consortium at UPenn).
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Ready-to-Use Online Bilingual Data
Chinese/English
Arabic/English
2004
2002
2000
1998
1996
French/English
1994
Millions of words
(English side)
???
180
160
140
120
100
80
60
40
20
0
One Billion?
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
From No Data to Sentence Pairs
Easy way: Linguistic Data Consortium (LDC)
Really hard way: pay $$$
Suppose one billion words of parallel data were sufficient
At 20 cents/word, that’s $200 million
Pretty hard way: Find it, and then earn it!
De-formatting
Remove strange characters
Character code conversion
Document alignment
Sentence alignment
Tokenization (also called Segmentation)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Sentence Alignment
The old man is happy. He has
fished many times. His wife
talks to him. The fish are
jumping. The sharks await.
CIS 490 / 730: Artificial Intelligence
El viejo está feliz porque ha
pescado muchos veces. Su
mujer habla con él. Los
tiburones esperan.
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Sentence Alignment
1.
2.
3.
4.
5.
The old man is happy.
He has fished many times.
His wife talks to him.
The fish are jumping.
The sharks await.
CIS 490 / 730: Artificial Intelligence
1.
2.
3.
El viejo está feliz porque ha
pescado muchos veces.
Su mujer habla con él.
Los tiburones esperan.
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Sentence Alignment
1.
2.
3.
4.
5.
The old man is happy.
He has fished many times.
His wife talks to him.
The fish are jumping.
The sharks await.
CIS 490 / 730: Artificial Intelligence
1.
2.
3.
El viejo está feliz porque ha
pescado muchos veces.
Su mujer habla con él.
Los tiburones esperan.
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Sentence Alignment
1.
2.
3.
The old man is happy. He
has fished many times.
His wife talks to him.
The sharks await.
1.
2.
3.
El viejo está feliz porque ha
pescado muchos veces.
Su mujer habla con él.
Los tiburones esperan.
Note that unaligned sentences are thrown out, and
sentences are merged in n-to-m alignments (n, m > 0).
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Tokenization (or Segmentation)
English
Input (some byte stream):
"There," said Bob.
Output (7 “tokens” or “words”):
" There , " said Bob .
Chinese
Input (byte stream):
Output:
美国关岛国际机场及其办公室均接获
一名自称沙地阿拉伯富商拉登等发出
的电子邮件。
美国 关岛国 际机 场 及其 办公
室均接获 一名 自称 沙地 阿拉 伯
富 商拉登 等发 出 的 电子邮件。
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
MT Evaluation
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
MT Evaluation
Manual:
SSER (subjective sentence error rate)
Correct/Incorrect
Error categorization
Testing in an application that uses MT as one sub-component
Question answering from foreign language documents
Automatic:
WER (word error rate)
BLEU (Bilingual Evaluation Understudy)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
BLEU Evaluation Metric
(Papineni et al, ACL-2002)
Reference (human) translation:
The U.S. island of Guam is
maintaining a high state of alert
after the Guam airport and its
offices both received an e-mail
from someone calling himself the
Saudi Arabian Osama bin Laden
and threatening a
biological/chemical attack against
public places such as the airport .
Machine translation:
The American [?] international
airport and its the office all
receives one calls self the sand
Arab rich business [?] and so on
electronic mail , which sends out ;
The threat will be able after public
place and so on the airport to start
the biochemistry attack , [?] highly
alerts after the maintenance.
CIS 490 / 730: Artificial Intelligence
• N-gram precision (score is between 0 & 1)
– What percentage of machine n-grams can
be found in the reference translation?
– An n-gram is an sequence of n words
– Not allowed to use same portion of reference
translation twice (can’t cheat by typing out
“the the the the the”)
• Brevity penalty
– Can’t just type out single word “the”
(precision 1.0!)
*** Amazingly hard to “game” the system (i.e., find a
way to change machine output so that BLEU
goes up, but quality doesn’t)
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
BLEU Evaluation Metric
(Papineni et al, ACL-2002)
Reference (human) translation:
The U.S. island of Guam is
maintaining a high state of alert
after the Guam airport and its
offices both received an e-mail
from someone calling himself the
Saudi Arabian Osama bin Laden
and threatening a
biological/chemical attack against
public places such as the airport .
Machine translation:
The American [?] international
airport and its the office all
receives one calls self the sand
Arab rich business [?] and so on
electronic mail , which sends out ;
The threat will be able after public
place and so on the airport to start
the biochemistry attack , [?] highly
alerts after the maintenance.
CIS 490 / 730: Artificial Intelligence
• BLEU4 formula
(counts n-grams up to length 4)
exp (1.0 * log p1 +
0.5 * log p2 +
0.25 * log p3 +
0.125 * log p4 –
max(words-in-reference / words-in-machine – 1,
0)
p1 = 1-gram precision
P2 = 2-gram precision
P3 = 3-gram precision
P4 = 4-gram precision
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Multiple Reference Translations
Reference translation 1:
The U.S. island of Guam is maintaining
a high state of alert after the Guam
airport and its offices both received an
e-mail from someone calling himself
the Saudi Arabian Osama bin Laden
and threatening a biological/chemical
attack against public places such as
the airport .
Reference translation 2:
Guam International Airport and its
offices are maintaining a high state of
alert after receiving an e-mail that was
from a person claiming to be the
wealthy Saudi Arabian businessman
Bin Laden and that threatened to
launch a biological and chemical attack
on the airport and other public places .
Machine translation:
The American [?] international airport
and its the office all receives one calls
self the sand Arab rich business [?]
and so on electronic mail , which
sends out ; The threat will be able
after public place and so on the
airport to start the biochemistry attack
, [?] highly alerts after the
maintenance.
Reference translation 3:
The US International Airport of Guam
and its office has received an email
from a self-claimed Arabian millionaire
named Laden , which threatens to
launch a biochemical attack on such
public places as airport . Guam
authority has been on alert .
CIS 490 / 730: Artificial Intelligence
Reference translation 4:
US Guam International Airport and its
office received an email from Mr. Bin
Laden and other rich businessman
from Saudi Arabia . They said there
would be biochemistry air raid to Guam
Airport and other public places . Guam
needs to be in high precaution about
this matter .
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
BLEU Tends to Predict Human Judgments
NIST Score
(variant of BLEU)
2.5
Adequacy
2.0
R2 = 88.0%
Fluency
R2 = 90.2%
1.5
Linear
(Adequacy)
Linear
(Fluency)
1.0
0.5
0.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-0.5
-1.0
-1.5
-2.0
-2.5
Human Judgments
slide from G. Doddington (NIST)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Word-Based Statistical MT
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Statistical MT Systems
Spanish/English
Bilingual Text
Statistical Analysis
Spanish
Que hambre tengo yo
CIS 490 / 730: Artificial Intelligence
English
Text
Statistical Analysis
Broken
English
What hunger have I,
Hungry I am so,
I am so hungry,
Wednesday,
29 Nov
2006
Have
I that
hunger
…
English
I am so hungry
Computing & Information Sciences
Kansas State University
Statistical MT Systems
Spanish/English
Bilingual Text
English
Text
Statistical Analysis
Statistical Analysis
Broken
English
Spanish
Translation
Model P(s|e)
Que hambre tengo yo
CIS 490 / 730: Artificial Intelligence
English
Language
Model P(e)
Decoding algorithm
argmax P(e) * P(s|e)
Wednesday,
e
29 Nov 2006
I am so hungry
Computing & Information Sciences
Kansas State University
Three Problems for Statistical MT
Language model
Given an English string e, assigns P(e) by formula
good English string
-> high P(e)
random word sequence
-> low P(e)
Translation model
Given a pair of strings <f,e>, assigns P(f | e) by formula
<f,e> look like translations
-> high P(f | e)
<f,e> don’t look like translations -> low P(f | e)
Decoding algorithm
Given a language model, a translation model, and a new sentence f …
find translation e maximizing P(e) * P(f | e)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
The Classic Language Model
Word N-Grams
Goal of the language model -- choose among:
He is on the soccer field
He is in the soccer field
Is table the on cup the
The cup is on the table
Rice shrine
American shrine
Rice company
American company
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
The Classic Language Model
Word N-Grams
Generative approach:
w1 = START
repeat until END is generated:
produce word w2 according to a big table P(w2 | w1)
w1 := w2
P(I saw water on the table) =
P(I | START) *
P(saw | I) *
P(water | saw) *
P(on | water) *
P(the | on) *
P(table | the) *
P(END | table)
Probabilities can be learned
from online English text.
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Translation Model?
Generative approach:
Mary did not slap the green witch
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
Maria no dió una botefada a la bruja verde
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Translation Model?
Generative story:
Mary did not slap the green witch
Source-language morphological analysis
What are all
the possible
moves and
their associated
probability
tables?
Source parse tree
Semantic representation
Generate target structure
Maria no dió una botefada a la bruja verde
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
The Classic Translation Model
Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]
Generative approach:
Mary did not slap the green witch
Mary not slap slap slap the green witch
Mary not slap slap slap NULL the green witch
n(3|slap)
P-Null
t(la|the)
Maria no dió una botefada a la verde bruja
d(j|i)
Maria no dió una botefada a la bruja verde
CIS 490 / 730: Artificial Intelligence
Probabilities
can be learned from
raw bilingual
Computing
& Informationtext.
Sciences
Wednesday, 29 Nov 2006
Kansas State University
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
All word alignments equally likely
All P(french-word | english-word) equally likely
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“la” and “the” observed to co-occur frequently,
so P(la | the) is increased.
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“house” co-occurs with both “la” and “maison”, but
P(maison | house) can be raised without limit, to 1.0,
while P(la | house) is limited because of “the”
(pigeonhole principle)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
settling down after another iteration
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
Inherent hidden structure revealed by EM training!
For details, see:
• “A Statistical MT Tutorial Workbook” (Knight, 1999).
• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)
• Software: GIZA++
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
P(juste | fair) = 0.411
P(juste | correct) = 0.027
P(juste | right) = 0.020
…
new French
sentence
CIS 490 / 730: Artificial Intelligence
Possible English translations,
to be rescored by language
model
Computing & Information Sciences
Wednesday, 29 Nov 2006
Kansas State University
Decoding for “Classic” Models
Of all conceivable English word strings, find the one maximizing P(e)
x P(f | e)
Decoding is an NP-complete challenge
(Knight, 1999)
Several search strategies are available
Each potential English output is called a hypothesis.
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
The Classic Results
la politique de la haine .
politics of hate .
the policy of the hatred .
nous avons signé le protocole .
we did sign the memorandum of agreement .
we have signed the protocol .
(Foreign Original)
(Reference Translation)
(IBM4+N-grams+Stack)
où était le plan solide ?
but where was the solid plan ?
where was the economic base ?
(Foreign Original)
(Reference Translation)
(IBM4+N-grams+Stack)
(Foreign Original)
(Reference Translation)
(IBM4+N-grams+Stack)
the Ministry of Foreign Trade and Economic Cooperation, including foreign
direct investment 40.007 billion US dollars today provide data include
that year to November china actually using
foreign 46.959 billion US
dollars
and
Computing
& Information
Sciences
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Kansas State University
Flaws of Word-Based MT
Multiple English words for one French word
IBM models can do one-to-many (fertility) but not many-to-one
Phrasal Translation
“real estate”, “note that”, “interest in”
Syntactic Transformations
Verb at the beginning in Arabic
Translation model penalizes any proposed re-ordering
Language model not strong enough to force the verb to move to the right place
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Phrase-Based Statistical MT
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Phrase-Based Statistical MT
Morgen
fliege
ich
Tomorrow
I
will fly
nach Kanada
to the conference
zur Konferenz
In Canada
Foreign input segmented in to phrases
“phrase” is any sequence of words
Each phrase is probabilistically translated into English
P(to the conference | zur Konferenz)
P(into the meeting | zur Konferenz)
Phrases are probabilistically re-ordered
See [Koehn et al, 2003] for an intro.
This is state-of-the-art!
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Advantages of Phrase-Based
Many-to-many mappings can handle non-compositional phrases
Local context is very useful for disambiguating
“Interest rate” …
“Interest in” …
The more data, the longer the learned phrases
Sometimes whole sentences
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
How to Learn the Phrase Translation Table?
One method: “alignment templates” (Och et al, 1999)
Start with word alignment, build phrases from that.
Maria
no
dió
una bofetada a
la
bruja verde
This word-to-word
alignment is a
by-product of
training a
translation model
like IBM-Model-3.
Mary
did
not
slap
the
This is the best
(or “Viterbi”)
alignment.
green
witch
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
How to Learn the Phrase Translation Table?
One method: “alignment templates” (Och et al, 1999)
Start with word alignment, build phrases from that.
Maria
no
dió
una bofetada a
la
bruja verde
This word-to-word
alignment is a
by-product of
training a
translation model
like IBM-Model-3.
Mary
did
not
slap
the
This is the best
(or “Viterbi”)
alignment.
green
witch
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
IBM Models are 1-to-Many
Run IBM-style aligner both directions, then merge:
EF best
alignment
MERGE
FE best
alignment
CIS 490 / 730: Artificial Intelligence
Union or Intersection
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
How to Learn the Phrase Translation Table?
Collect all phrase pairs that are consistent with the word alignment
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
one
example
phrase
pair
the
green
witch
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Consistent with Word Alignment
Maria
no dió
Maria
no dió
Maria
Mary
Mary
Mary
did
did
did
not
not
x
no dió
not
x
slap
slap
consistent
slap
inconsistent
inconsistent
Phrase alignment must contain all alignment points for all
the words in both phrases!
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) …
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Word Alignment Induced Phrases
Maria
no
dió
una bofetada a
la
bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) …
& Information Sciences
CIS
490 / no
730:dió
Artificial
Intelligencea la bruja verde,
Wednesday,
29not
Nov slap
2006 the greenComputing
(Maria
una bofetada
Mary did
witch) Kansas State University
Phrase Pair Probabilities
A certain phrase pair (f-f-f, e-e-e) may appear many times across the
bilingual corpus.
We hope so!
So, now we have a vast list of phrase pairs and their frequencies – how to
assign probabilities?
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Phrase Pair Probabilities
Basic idea:
No EM training
Just relative frequency:
P(f-f-f | e-e-e) = count(f-f-f, e-e-e) / count(e-e-e)
Important refinements:
Smooth using word probs P(f | e) for individual words connected in the word
alignment
Some low count phrase pairs now have high probability, others have low
probability
Discount for ambiguity
If phrase e-e-e can map to 5 different French phrases, due to the ambiguity of
unaligned words, each pair gets a 1/5 count
Count BAD events too
If phrase e-e-e doesn’t map onto any contiguous French phrase, increment event
count(BAD, e-e-e)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Advanced Training Methods
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Basic Model, Revisited
argmax P(e | f) =
e
argmax P(e) x P(f | e) / P(f) =
e
argmax P(e) x P(f | e)
e
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Basic Model, Revisited
argmax P(e | f) =
e
argmax P(e) x P(f | e) / P(f) =
e
argmax P(e)2.4 x P(f | e)
e
CIS 490 / 730: Artificial Intelligence
… works better!
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Basic Model, Revisited
argmax P(e | f) =
e
argmax P(e) x P(f | e) / P(f)
e
argmax P(e)2.4 x P(f | e) x length(e)1.1
e
Rewards longer hypotheses, since
these are unfairly punished by P(e)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Basic Model, Revisited
argmax P(e)2.4 x P(f | e) x length(e)1.1 x KS 3.7 …
e
Lots of knowledge sources vote on any given hypothesis.
“Knowledge source” = “feature function” = “score component”.
Feature function simply scores a hypothesis with a real value.
(May be binary, as in “e has a verb”).
Problem: How to set the exponent weights?
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
MT Pyramid
interlingua
semantics
syntax
phrases
words
SOURCE
CIS 490 / 730: Artificial Intelligence
semantics
syntax
phrases
words
TARGET
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Why Syntax?
Need much more grammatical output
Need accurate control over re-ordering
Need accurate insertion of function words
Word translations need to depend on grammatically-related words
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Yamada/Knight 01: Modeling and Training
Parse Tree(E)
VB
PRP
VB1
he
adores
VB
VB2
Reorder
VB
he
TO
listening TO
to
MN
music
he
VB2
ha
TO
VB1
VB
MN
TO
music
to
VB2
VB1
TO
VB
MN
TO
music
to
Translate
ga
VB
PRP
VB2
kare
adores
ha
desu
listening no
adores
listening
Insert
VB
PRP
PRP
TO
MN
ongaku
VB1
VB
ga
daisuki desu
TO
wo kiku
no
Take Leaves
.
Sentence(J)
Kare ha ongaku wo kiku no ga daisuki desu
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Japanese/English Reorder Table
Original Order
PRP VB1 VB2
VB TO
TO NN
Reordering
PRP VB1 VB2
PRP VB2 VB1
VB1 PRP VB2
VB1 VB2 PRP
VB2 PRP VB1
VB2 VB1 PRP
VB TO
TO VB
TO NN
NN TO
P(reorder|original)
0.074
0.723
0.061
0.037
0.083
0.021
0.107
0.893
0.251
0.749
For French/English, useful parameters like P(N ADJ | ADJ N).
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Casting Syntax MT Models As Tree Transducer Automata
[Graehl & Knight 04]
Non-local Re-Ordering (English/Arabic)
Non-constituent Phrasal Translation (English/Spanish)
qS
qS
S
PRO
NP1 VP
VP NP1 NP2
VB NP2
S
PR
VP
there VB
NP
are CD NN
two men
Lexicalized Re-Ordering (English/Chinese)
NP
hay CD NN
dos hombres
Long-distance Re-Ordering (English/Japanese)
qS
NP
NP1 PP
S
NP
NP2
P NP1
P NP2
of
WH-NP SINV/NP
Who MD
S
S/NP
did NP VP/NP
VB
see
CIS 490 / 730: Artificial Intelligence
*
Wednesday, 29 Nov 2006
ka
NP
NP
S
P
NP
ga PRO P
S
VB
dare o <saw>
Computing & Information Sciences
Kansas State University
Summary
Phrase-based models are state-of-the-art
Word alignments
Phrase pair extraction & probabilities
N-gram language models
Beam search decoding
Feature functions & learning weights
But the output is not English
Fluency must be improved
Better translation of person names, organizations, locations
More automatic acquisition of parallel data, exploitation of monolingual data across a
variety of domains/languages
Need good accuracy across a variety of domains/languages
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Available Resources
Bilingual corpora
100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu)
Lots of French/English, Spanish/French/English, LDC
European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI
20m words (sentence-aligned) of English/French, Ulrich Germann, ISI
GIZA, JHU Workshop ’99 (www.clsp.jhu.edu/ws99/projects/mt/)
GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html)
Manually word-aligned test corpus (500 French/English sentence pairs), RWTH Aachen
Shared task, NAACL-HLT’03 workshop
Decoding
Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm)
Xiaoyi Ma, LDC (Champollion)
Word alignment
(www.isi.edu/natural-language/download/hansard/)
Sentence alignment
(www.isi.edu/~koehn/publications/europarl)
ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/)
ISI Pharoah phrase-based decoder
Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/)
Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Some Papers Referenced on Slides
ACL
[Och, Tillmann, & Ney, 1999]
[Och & Ney, 2000]
[Germann et al, 2001]
[Yamada & Knight, 2001, 2002]
[Papineni et al, 2002]
[Alshawi et al, 1998]
[Collins, 1997]
[Koehn & Knight, 2003]
[Al-Onaizan & Knight, 2002]
[Och & Ney, 2002]
[Och, 2003]
[Koehn et al, 2003]
•
– [Soricut et al, 2002]
– [Al-Onaizan & Knight, 1998]
•
[Marcu & Wong, 2002]
[Fox, 2002]
[Munteanu & Marcu, 2002]
AI Magazine
www.isi.edu/~knight
[Knight, 1997]
EACL
– [Cmejrek et al, 2003]
•
Computational Linguistics
– [Brown et al, 1993]
– [Knight, 1999]
– [Wu, 1997]
EMNLP
AMTA
•
AAAI
– [Koehn & Knight, 2000]
•
IWNLG
– [Habash, 2002]
[MT Tutorial Workbook]
•
MT Summit
– [Charniak, Knight, Yamada, 2003]
•
NAACL
–
–
–
–
CIS 490 / 730: Artificial Intelligence
[Koehn, Marcu, Och, 2003]
[Germann, 2003]
[Graehl & Knight, 2004]
[Galley, Hopkins, Knight, Marcu, 2004]
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Terminology
Simple Bayes, aka Naïve Bayes
Zero counts: case where an attribute value never occurs with a label in D
No match approach: assign an c/m probability to P(xik | vj)
m-estimate aka Laplace approach: assign a Bayesian estimate to P(xik | vj)
Learning in Natural Language Processing (NLP)
Training data: text corpora (collections of representative documents)
Statistical Queries (SQ) oracle: answers queries about P(xik, vj) for x ~ D
Linear Statistical Queries (LSQ) algorithm: classification using f(oracle response)
•
Includes: Naïve Bayes, BOC
•
Other examples: Hidden Markov Models (HMMs), maximum entropy
Problems: word sense disambiguation, part-of-speech tagging
Applications
•
Spelling correction, conversational agents
•
Information retrieval: web and digital library searches
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University
Summary Points
More on Simple Bayes, aka Naïve Bayes
More examples
Classification: choosing between two classes; general case
Robust estimation of probabilities: SQ
Learning in Natural Language Processing (NLP)
Learning over text: problem definitions
Statistical Queries (SQ) / Linear Statistical Queries (LSQ) framework
•
Oracle
•
Algorithms: search for h using only (L)SQs
Bayesian approaches to NLP
•
Issues: word sense disambiguation, part-of-speech tagging
•
Applications: spelling; reading/posting news; web search, IR, digital libraries
Next Week: Section 6.11, Mitchell; Pearl and Verma
Read: Charniak tutorial, “Bayesian Networks without Tears”
Skim: Chapter 15, Russell and Norvig; Heckerman slides
CIS 490 / 730: Artificial Intelligence
Wednesday, 29 Nov 2006
Computing & Information Sciences
Kansas State University