Adapting EBMT to Chinese - Carnegie Mellon University
Download
Report
Transcript Adapting EBMT to Chinese - Carnegie Mellon University
Adapting EBMT to Chinese
Joy (Ying Zhang)
[email protected]
Jan 26, 2001
17 July 2015
Adapting EBMT to Chinese
[email protected]
1
Topics
Project overview
EBMT outline
Chinese language
Improved Segmenter
English phrase recognizing and bracketing
Statistical dictionary
Results
Ongoing and future work
17 July 2015
Adapting EBMT to Chinese
[email protected]
2
Project Overview
Part of Lingwear, TIDES
Adapting existing multi-engine Pangloss
MT system to Chinese-English
Quick-deploy MT system, develop MT
with the smallest amount of human effort
and knowledge
17 July 2015
Adapting EBMT to Chinese
[email protected]
3
Multi-engine MT system
There are three translation engines in the
current system:
– EBMT: Example Based Machine Translation
– DICTionary: to provide coverage for words
not otherwise covered by EBMT, it can be
constructed automatically from binlingual
corpus
– GLOSSaries: from hand-crafted word/phrase
bilingual glossaries
17 July 2015
Adapting EBMT to Chinese
[email protected]
4
EBMT outline
Concepts
–
–
–
An Example-Based Machine Translation (EBMT) system is given a set of sentences in the source
language (from which one is translating) and their corresponding translations in the
target language, and uses those examples to translate other, similar source-language sentences
into the target language.
The basic premise is that, if a previously translated sentence occurs again, the same translation is
likely to be correct again. (Ralf. Brown)
Other EBMT systems operate on parse trees, or find the most similar complete sentence and
modify its translation based on the differences between the sentence to be translated and the
matched example. (Ralf. Brown)
Our system is a shallow EBMT system
Bilingual corpus
Indexing (using dictionary)---Matching
One of the most important issues: increase the
performance of MATCHING
17 July 2015
Adapting EBMT to Chinese
[email protected]
5
Chinese language
Character
– Unit for constructing word, almost each character has a meaning.
When constructed with other characters to form a word, the
meaning of the word may be different with the meaning of the
character
Word:
– Usually bigram (two character word), a unigram, trigram or 4gram, n-gram with n>4 are specific idioms (Data from FDMC
1986)
17 July 2015
Unigram
Bigram
Trigram
4-gram
5-gram
26.7%
69.8%
2.7%
0.007%
0.0002%
Adapting EBMT to Chinese
[email protected]
6
Chinese language (cont.)
Problems with words
– Vague definition of words
• E.g. People’s Republic of China (all these words
can be considered as legal words)
17 July 2015
Adapting EBMT to Chinese
[email protected]
7
Chinese language (cont.)
– Unknown words
• New words
• Words unique for a certain domain, e.g. legal code
17 July 2015
Adapting EBMT to Chinese
[email protected]
8
Chinese language (cont.)
Segmentation
– Segmenting words from the sequence of
characters
– LDC segmenter, using dynamic algorithm,
depends on a frequency dictionary
Problem of LDC segmentation
– The frequency dictionary can not cover the
corpus (miss-segmentation)
17 July 2015
Adapting EBMT to Chinese
[email protected]
9
Chinese language (cont.)
Consequence of miss-segmentation
– Match??
– The longer the word, the better coverage for
EBMT (encapsulating the context into the
word)
17 July 2015
Adapting EBMT to Chinese
[email protected]
10
Improved Segmenter
Basic ideas: using statistical lexical
acquisition to augment the frequency
dictionary for the segmenter
Steps:
– Using sliding window extract repeating
patterns (sequence of characters) from the
corpus
– Refine patterns to construct longer words/term
17 July 2015
Adapting EBMT to Chinese
[email protected]
11
Improved Segmenter (cont.)
Assumptions:
1. Localization: Same type of word appears more
frequently near each other, rather than distributed
evenly among the whole corpus
17 July 2015
Adapting EBMT to Chinese
[email protected]
12
Improved Segmenter (cont.)
17 July 2015
Adapting EBMT to Chinese
[email protected]
13
Improved Segmenter
Assumption:
2. If there will be another pattern appear, it should
appear in a range related to the average distance
of appeared patterns
17 July 2015
Adapting EBMT to Chinese
[email protected]
14
Improved Segmenter
Results:
– Hard to evaluate, because the vague definition
of words
– The effects of improved segmenter can be seen
in the improvement of EBMT coverage
17 July 2015
Adapting EBMT to Chinese
[email protected]
15
English phrase bracket
Match:
– As we increased the length in average the
length of Chinese words, to match between the
Chinese and English part of corpus, we did the
similar thing for English
– Recognizing English phrase and bracketing the
corpus (replacing the blank with underscore)
e.g. the_people’s_republic_of_china (it will be
treated as a word)
17 July 2015
Adapting EBMT to Chinese
[email protected]
16
Statistical dictionary
Step1: collapsing the inflection form of
English phrase/words to one class
– Algorithm: Longest common sub string of two phrases
should be long enough.
17 July 2015
Adapting EBMT to Chinese
[email protected]
17
Statistical dictionary
Step2: building statistical dictionary
– Algorithm (with help from Benjamin)
S: source language word
T: target language word
F ( S , T ) a P( S | T ) P(T | S ) b (
17 July 2015
min(P( S | T ), P(T | S ))
)
max(P( S | T ), P(T | S ))
Adapting EBMT to Chinese
[email protected]
18
Statistical dictionary
Iteration
– As the improved segmenter and phrase extraction all
work monolingually, there is possibility that Chinese
term extracted can not be found with a translation
– Using only Chinese words and English phrases that
are found with translation to re-segment/re-bracketing
the corpus.
– Build statistical dictionary again.
– Repeat this loop for several times, size of statistical
dictionary increased.
17 July 2015
Adapting EBMT to Chinese
[email protected]
19
Results
17 July 2015
Adapting EBMT to Chinese
[email protected]
Exp16: Baseline system
Exp15: Base system
+ improved segmenter
Exp18: Base system
+ improved segmenter
+ StatDict
Exp14: Base system
+ improved segmenter
+ bracketer
+ statistical dictionary
(3 iterations)
20
Ongoing and future work
Feed back from statistical dictionary to
segmenter and brackter
Topic detection, corpus clustering
Related work ongoing:
– Ralf: Generalization, word clustering
– Erik: Relative clause detection and reordering
17 July 2015
Adapting EBMT to Chinese
[email protected]
21