Transcript slides
Bilingual Alignment Models:
Cognates and Phrases
Andrea Burbank
Dinkar Gupta
Spring 2006
French/English word alignments
• Cognates
• French and English share many words with common roots
• Can identifying cognate pairs improve alignments?
• What about a distribution based on word lengths?
• Phrases capture conceptual mappings
• Overcome language specific syntax and constructs
• Examples: “pommes frites” “French fries”, “à demain” “see you tomorrow”,
“ne veux jamais” “never wants”
• Aligned phrases need not be long - 3 or 4 words
• Concepts arrangement same in French and English
Cognate Identification
• Identifying clear cognate matches can help
create benchmarks for alignment
• Different cognate matching metrics:
• match the first four letters (e.g. suggère, suggests)
• count shared bigrams (Dice coefficient)
• e.g. unité, unity = un + ni + it =3/4
• longest common subsequence ratio (LCSR)
• e.g. couleur, color = c-o-l-r = 4/max(le, lf) = 4/7
• count shared letters and normalize by length
• e.g. chat, cat = cat/chat + cat/cat = (3/4+3/3)/2e–(4-3)
• Incorporating cognates: add pairs to training set
• Word-length distributions: EM algorithm
• P(lf | le) iteratively calculated in Model 1
Phrase mappings
• Good Mappings
• “la Bourse the Toronto”
“The Toronto Stock
Exchange”
• “les actes criminels”
“crimes of violence”
• “excusez-nous si” “excuse
us if we”
• “serait que” “that it would”
• “profiter de le occasion”
“take this opportunity to”
•
Extraneous
• “la vision que” “vision that”
• Bad Mappings
• “les pays” “the country to”
• “le gouffre financière”
“cheered on by”
• Good: “pouvons travailler”
•
•
•
•
•
•
“can work”
“can work together”
“can work together within”
“can all work”
“can all work together”
“we can all work”
• Bad: “les administrations”
•
•
•
•
•
•
“of the GDP over”
“percent of the GDP over”
“4.22 percent of”
“to 4.22 percent of”
“of the GDP”
“4.22 percent of the”
Results: significant improvements!
Model 1 trained on the test set
with phrases
Model 1 with and without cognates
words only