Adapting EBMT to Chinese - Carnegie Mellon University

Download Report

Transcript Adapting EBMT to Chinese - Carnegie Mellon University

Adapting EBMT to Chinese
Joy (Ying Zhang)
[email protected]
Jan 26, 2001
17 July 2015
Adapting EBMT to Chinese
[email protected]
1
Topics
 Project overview
 EBMT outline
 Chinese language
 Improved Segmenter
 English phrase recognizing and bracketing
 Statistical dictionary
 Results
 Ongoing and future work
17 July 2015
Adapting EBMT to Chinese
[email protected]
2
Project Overview
Part of Lingwear, TIDES
Adapting existing multi-engine Pangloss
MT system to Chinese-English
Quick-deploy MT system, develop MT
with the smallest amount of human effort
and knowledge
17 July 2015
Adapting EBMT to Chinese
[email protected]
3
Multi-engine MT system
There are three translation engines in the
current system:
– EBMT: Example Based Machine Translation
– DICTionary: to provide coverage for words
not otherwise covered by EBMT, it can be
constructed automatically from binlingual
corpus
– GLOSSaries: from hand-crafted word/phrase
bilingual glossaries
17 July 2015
Adapting EBMT to Chinese
[email protected]
4
EBMT outline
 Concepts
–
–
–
An Example-Based Machine Translation (EBMT) system is given a set of sentences in the source
language (from which one is translating) and their corresponding translations in the
target language, and uses those examples to translate other, similar source-language sentences
into the target language.
The basic premise is that, if a previously translated sentence occurs again, the same translation is
likely to be correct again. (Ralf. Brown)
Other EBMT systems operate on parse trees, or find the most similar complete sentence and
modify its translation based on the differences between the sentence to be translated and the
matched example. (Ralf. Brown)
Our system is a shallow EBMT system
 Bilingual corpus
 Indexing (using dictionary)---Matching
 One of the most important issues: increase the
performance of MATCHING
17 July 2015
Adapting EBMT to Chinese
[email protected]
5
Chinese language
Character
– Unit for constructing word, almost each character has a meaning.
When constructed with other characters to form a word, the
meaning of the word may be different with the meaning of the
character
Word:
– Usually bigram (two character word), a unigram, trigram or 4gram, n-gram with n>4 are specific idioms (Data from FDMC
1986)
17 July 2015
Unigram
Bigram
Trigram
4-gram
5-gram
26.7%
69.8%
2.7%
0.007%
0.0002%
Adapting EBMT to Chinese
[email protected]
6
Chinese language (cont.)
Problems with words
– Vague definition of words
• E.g. People’s Republic of China (all these words
can be considered as legal words)
17 July 2015
Adapting EBMT to Chinese
[email protected]
7
Chinese language (cont.)
– Unknown words
• New words
• Words unique for a certain domain, e.g. legal code
17 July 2015
Adapting EBMT to Chinese
[email protected]
8
Chinese language (cont.)
Segmentation
– Segmenting words from the sequence of
characters
– LDC segmenter, using dynamic algorithm,
depends on a frequency dictionary
Problem of LDC segmentation
– The frequency dictionary can not cover the
corpus (miss-segmentation)
17 July 2015
Adapting EBMT to Chinese
[email protected]
9
Chinese language (cont.)
Consequence of miss-segmentation
– Match??
– The longer the word, the better coverage for
EBMT (encapsulating the context into the
word)
17 July 2015
Adapting EBMT to Chinese
[email protected]
10
Improved Segmenter
Basic ideas: using statistical lexical
acquisition to augment the frequency
dictionary for the segmenter
Steps:
– Using sliding window extract repeating
patterns (sequence of characters) from the
corpus
– Refine patterns to construct longer words/term
17 July 2015
Adapting EBMT to Chinese
[email protected]
11
Improved Segmenter (cont.)
 Assumptions:
1. Localization: Same type of word appears more
frequently near each other, rather than distributed
evenly among the whole corpus
17 July 2015
Adapting EBMT to Chinese
[email protected]
12
Improved Segmenter (cont.)
17 July 2015
Adapting EBMT to Chinese
[email protected]
13
Improved Segmenter
 Assumption:
2. If there will be another pattern appear, it should
appear in a range related to the average distance
of appeared patterns
17 July 2015
Adapting EBMT to Chinese
[email protected]
14
Improved Segmenter
Results:
– Hard to evaluate, because the vague definition
of words
– The effects of improved segmenter can be seen
in the improvement of EBMT coverage
17 July 2015
Adapting EBMT to Chinese
[email protected]
15
English phrase bracket
Match:
– As we increased the length in average the
length of Chinese words, to match between the
Chinese and English part of corpus, we did the
similar thing for English
– Recognizing English phrase and bracketing the
corpus (replacing the blank with underscore)
e.g. the_people’s_republic_of_china (it will be
treated as a word)
17 July 2015
Adapting EBMT to Chinese
[email protected]
16
Statistical dictionary
Step1: collapsing the inflection form of
English phrase/words to one class
– Algorithm: Longest common sub string of two phrases
should be long enough.
17 July 2015
Adapting EBMT to Chinese
[email protected]
17
Statistical dictionary
Step2: building statistical dictionary
– Algorithm (with help from Benjamin)
S: source language word
T: target language word
F ( S , T )  a  P( S | T )  P(T | S )   b  (
17 July 2015
min(P( S | T ), P(T | S ))
)
max(P( S | T ), P(T | S ))
Adapting EBMT to Chinese
[email protected]
18
Statistical dictionary
 Iteration
– As the improved segmenter and phrase extraction all
work monolingually, there is possibility that Chinese
term extracted can not be found with a translation
– Using only Chinese words and English phrases that
are found with translation to re-segment/re-bracketing
the corpus.
– Build statistical dictionary again.
– Repeat this loop for several times, size of statistical
dictionary increased.
17 July 2015
Adapting EBMT to Chinese
[email protected]
19
Results

17 July 2015
Adapting EBMT to Chinese
[email protected]
Exp16: Baseline system

Exp15: Base system
+ improved segmenter

Exp18: Base system
+ improved segmenter
+ StatDict

Exp14: Base system
+ improved segmenter
+ bracketer
+ statistical dictionary
(3 iterations)
20
Ongoing and future work
Feed back from statistical dictionary to
segmenter and brackter
Topic detection, corpus clustering
Related work ongoing:
– Ralf: Generalization, word clustering
– Erik: Relative clause detection and reordering
17 July 2015
Adapting EBMT to Chinese
[email protected]
21