Alignment of Bilingual Named Entities in Parallel Corpora Using
Download
Report
Transcript Alignment of Bilingual Named Entities in Parallel Corpora Using
Alignment of Bilingual Named Entities
in Parallel Corpora Using Statistical
Model
Chun-Jen Lee
Jason S. Chang
Thomas C. Chuang
AMTA 2004
Introduction
Focusing on extracting entity names (PER, LOC, ORG) in
bilingual corpus.
The feasibility of extracting interlingual NEs has seldom been
addressed.
–
–
–
–
–
–
Al Onaizan and Knight 2002
Huang and Vogel 2002
Chen et al. 2003
Moore 2003
Kumano et al. 2004
Lee et al. (Baseline Model) 2003
Integrating approximate matching and personal name
recognition into the baseline model.
Framework
1.
Preprocess:
1)
2)
2.
Perform sentence alignment.
Label English named entities.
Main process:
1.
2.
3.
4.
For each labeled NEE, apply Statistical Probability Translation
Model and Approximate Matching to find Chinese named-entity
candidates {NEA} in SC.
For any word WE, in NEE, that cannot find the corresponding
Chinese translation in SC, apply the proposed Statistical
Transliteration Model, enhanced with Chinese Personal Name
Recognition to extracting the corresponding Chinese
transliterations {NEB}, in SC, with scores above a predefined
threshold.
Merge {NEA} with {NEB} into possible candidates {NEC}.
Rank {NEC} by the cost scores. The candidate with the maximum
score is chosen as the answer.
SPTM
A noisy channel approach
Translating an English phrase e with l words into a
Mandarin Chinese phrase f with m words by
decomposing the channel function into two
independent probabilistic functions:
–
–
Lexical translation probability function P(fai | ei) where ei is
the i-th word in e and ei is aligned with fai in f under the
alignment a
Alignment probability function P(a | l, m) = P(a1, a2, …, al | l, m)
SPTM
E = “Ichthyosis Concern Association”
F = “關懷 魚鱗癬 協會”
Correct alignment: (a1 = 2, a2 = 1, a3 = 3).
The phrase translation probability is
Defining the scoring function as a log probability function:
Estimating Lexical Translation
Probability Based on Parallel Corpus
Adopting a word alignment module to automatically extracting
lexical translation probabilities. (Wu and Chang 2003)
1.
2.
3.
4.
Developing a list of preferred part-of-speech (POS) patterns of
collocation in both languages
Conducting collocation candidates matching to the preferred POS
patterns and apply N-gram statistics for both languages
The log likelihood ratio statistics is employed for two consecutive
words in both languages
Finally, we deploy content word alignment based on the
Competitive Linking Algorithm (Melamed 1997).
For the purpose of not introducing too much noise, only
bilingual phrases with high probabilities are considered.
Estimating Lexical Translation Probability
Based on Transliteration Model
Adopting a Romanization system to represent a Chinese word
E and F are assumed to be an English word and a Romanized
Chinese character sequence, respectively.
The transliteration probability P(F|E) can be approximated by
decomposing E and F into transliteration units (TUs).
A word E with l characters and a Romanized word F with m
characters are denoted by e1 e2 …el and f1 f2 …fm respectively.
We can represent the mapping of (E, F) as a sequence of
matched n TUs: {(u1, v1), (u2, v2), … (un, vn) }.
The alignment a between E and F can be represented as a
sequence of match type (m1 m2 …mn) where mi denotes as a
pair of lengths of ui and vi.
Estimating Lexical Translation Probability
Based on Transliteration Model
NE alignment
1.
g(0,0) = 0
2.
3.
Suppose that there is an entry (ei ,wf) in the bilingual dictionary.
Scorelex(fai | ei) is formulated as:
Approximate Matching
CPNR
Chinese surnames are used as anchor points.
The Chinese personal name recognizer is applied
only on the case that the given NE is a named
person and Scoretm(R(fai) | ei) is less than Thr1.
Training Data
Noun phrases of the BDC Electronic Chinese-English
Dictionary were used to train PTM.
–
To train the transliteration model, 2,430 pairs of English names
together with their Chinese transliterations and Chinese
Romanization were used.
The LDC Central News Agency Corpus was used to extract
keywords of entity names for identifying NE types. We collected
117 bilingual keyword pairs from the corpora.
A list of Chinese surnames was also gathered to help to identify
and extract the PER-type of NEs.
The parallel corpus collected from the Sinorama Magazine was
used to construct the corpus-based lexicon and estimation of
LTP.
Experiments
275 aligned sentences from Sinorama are randomly selected.
Answer keys are manually prepared.
Each chosen aligned sentence contains at least one NE pair.
Currently, the lengths of English NEs are restricted to be less than 6.
In total, 830 pairs of NEs are labeled. The numbers of NE pairs for
types PER, LOC, and ORG are 208, 362, and 260, respectively.