IE and KD for Foreign Languages and Low
Download
Report
Transcript IE and KD for Foreign Languages and Low
IE for Low-resource Languages
Heng Ji
Outline
• Name Translation Mining
• Bi-lingual Dictionary Induction
• Cross-lingual Projection
2
Why Name Translation
Our Goal:
Break the Language Barrier
Online Language Populations (Total: 801.4 Million, Sept 2004)
•
Standard
MT
is
Simply
not
Enough
Source Text
俄塔社援引紧急情况部莫斯科市总局新闻处处长博贝列夫 (Bo Bei Lie Fu)的话...
•
Reference Translation
The Russian news agency Tass, quoting Director Bobylev of the news office of the Moscow city
headquarters of the Emergency Situation Department...
•
Various MT System Translations
o Russia 's Tass news agency quoted the ministry for emergency situations of the Moscow city ,
Director of Information Services , German Gref...
Itar-Tass quoted the Emergency Situations Ministry in Moscow City Administration Director
Bo , yakovlev...
Russia 's Tass news agency of the Ministry of Emergency Situations Moscow city
administration of Addis Ababa , Director of Information Services...
Russian news agency quoted the ministry of emergency situations in Moscow city
administration of the Director of Information Services , A. Kozyrev...
Itar-Tass quoted the Emergencies Ministry in Moscow , the Director of information in 1988
lev...
Name Translation Maze
English
Phonetic
Name
Semantic
Name
Chinese
Semantic+
Phonetic
Name
Semantic
Name
花旗银行
解放之虎
“Colorful-flag Bank”
Citibank
Liberation Tiger
长江 “Long River”
Yangtze River
Phonetic
Name
尤申科
可伶可俐
欧佩尔吧
“You shen ke”
Yushchenko
“Ke Ling Ke Li”
Clean Clear
Opal Bar
清华大学学报
华尔街
“The Journal of
“Hua Er Street”
Wall Street
尤干斯克石油天然气
公司
Semantic+
Phonetic
Name
Need advanced
Tsinghua University”
transliteration
Tsinghua Da Xue
model
Xue Bao
But not only these…
Yuganskneftegaz Oil
and Gas Company
Name Translation Maze
English
Phonetic
Name
Semantic
Name
Semantic+
Phonetic
Name
Semantic
Name
…
…
…
红军 Red Army (in China)
Use Global
Phonetic
Name
…
…
…
Context
亚西尔·阿拉法特 Yasser Arafat
(PLO Chairman)
…
圣地亚哥市 Santiago City (in Chile)
Chinese
Semantic+
Phonetic
Name
…
…
Context-Dependent Name
Liverpool Football Club
(England)
English
Yasir Arafat (Cricketer)
San Diego City (in CA)
潘基文 Pan Jiwen (Chinese)
No-Clue
Name
Ban Ki-Moon (Korean Foreign Minister)
林一 Lin Yi (Chinese)
Hayashi Hajime (Japanese Writer)
Motivation
•
Traditional methods use supervised transliteration and LM re-scoring
•
To discover name pairs from comparable corpora
o About similar topics, but are not in general translations of each other
o Naturally available; e.g. many news agencies release multi-lingual news articles on the
same day
•
Limitation of Previous approaches
o Require a supervised name transliteration module as baseline, exploit the distribution
evidence from comparable corpora only for re-scoring
o Limited to names which are phonetically transliterated; while many organizations are
often rendered semantically
o Cannot disambiguate names according to context
•
Toward transliteration-free approach: Constructing Information Networks
o There are no document-level or sentence-level alignments, but names, relations and
events in one language tend to co-occur with their counterparts in the other
o Information extraction (IE) techniques are currently available for some non-English
languages
7
Bilingual Information Networks (Ji, 2009)
库瓦斯
Arequipa
Sibling
Leader
2. 蒙特西诺斯
1. 国家情报局
Arrest/2001-06-25
3. 卡西俄
Located
Located
Leader
利马
藤森
Birth-Place
1. National
Intelligence
Service
Arrest/2001-06-25
3. Callao
4.秘鲁
Capital
Birth-Place
Leader
2. Montesinos
Located
Located
4. Peru
Located
Jorge Chavez
Intl. Airport
8
(Lin et al., 2011)
•3000 languages are endangered; Important
to cross-lingual access a range of languages
•Goal: Mine name translation pairs from
Wikipedia Infoboxes
9
Contextual Cue (Klementiev et al., 2011)
13
Temporal Cue
14
Orthographic and Phonetic Cue
• Transliteration based match
• Getting the phonetic representation of English and Chinese
candidates
• For example, “father” would be transformed to “faDR”, “港”
would be transformed to “gang3”.
• Splitting the phonetic representations into basic phoneme
units.
o Note: There’s some questions about the original paper.
• Building a phoneme pronunciation similarity (PPS) table
• Treating the problem as a weighted longest common
subsequence problem
• Finding the optimal longest common subsequence
• Normalizing the score of the optimal solution by dividing the
maximum length of two sequences
• Using the normalized score as the phonetic similarity score of
two representations 15
Advanced Person Name Transliteration
Averaged Perceptron Name Transliteration Model
Selects transcription from English name lists based on edit distance
Generates transcriptions if name not on the list
Char-based MT Name Transliteration Model
No reordering model due to monotonicity of the task
Tune model scaling factors for maximum transliteration accuracy
Feed in both tokens and pinyins, generate NBest transliterations
Combination of two achieved 3.6% and 6% higher accuracy
than each alone (Freitag and Khadivi, 2007)
16/22
Global Name Selection
with English Resources
Find the correct name translation by comparing contexts with:
Large English corpus (Kalmar and Blume, 2007)
Multi-token names
Name frequency in the English corpus
Document context
• Person titles (within-document co-reference resolution)
• Co-occurring entities
• Document date
• Document topic
Not using the edit distance models with Asian target names (regardless
of whether Mandarin, Cantonese, Korean, Japanese, etc.), select the
best one based on context
Large English name list and Gigaword
Re-score the N-best transliteration hypotheses
Build a large character-level LM
16.7% relative error reduction than name transliteration only
17/22
Example of Using Document Context
Lawyer
… 据国际文传电讯社和伊塔塔斯社报道,格里戈里
·帕斯科的
Grigory
Pasko
律师詹利·雷兹尼克向俄最高法院提 出上诉。
报道说,他请求法
zhan li
lei zi
ni ke
庭宣布有罪判决无
效,并取消对帕斯科的刑事立案。
帕斯科于
24.11
amri
28.31 reznik 有期徒刑,罪名是非法参加一个高级
2001 年
12 月被判处四年
23.09 obry
26.40 rezek
军事指挥官
一个军事法庭说他意 图将
22.57
zeri 会议,并在会上做笔记。
25.24 linic
20.82
henri 23.95 riziq
笔记提供给他曾供职的日本媒体。
帕斯科的判决包括已服刑的时
20.00
henry 23.25 二刑期后,他于今年一月因表现良好被释放。
ryshich
间。在服满三分之
Genri
HenryReznik,
Reznik Goldovsky's lawyer, asked
19.82 genri 22.66 lysenko
Russian
Supreme Court
Chairman
Genri Reznik
他坚持称自己是无辜的,并表示军方因其披露俄
罗斯海军的环境
19.67
djari 22.58 ryzhenko
Vyacheslav Lebedev….
19.57
jafri 22.19 linnik
破坏而惩罚他,这包括向海里倾
倒放射性废弃物。
据国际文传
电讯社报道,雷兹尼克表示他在帕斯 科获释当日提交的最初一
份上诉状从未到达过最 高法院主席团手中。 这名律师说法院的
>90% accurate!
军事委 员会拒绝对上诉进行审理。国际文传电讯社报道,雷兹尼
克表示他在新诉状 的抬头上直接写着最高法院院长维亚切斯拉
Vyacheslav Lebedev
夫· 列别捷夫,并要求此案不由军事法官考虑,“因
为军事司法
18/22
制度对帕斯科采取了偏见态度”
Mining from Code-switch Webpages (Lin et al., 2008)
• Searching the parallel data on the web (Resnik 2003)
• Searching the comparable corpus on the web (Fung 1998)
Mining Key Phrase Translations
from Web Corpora
19
Bilingual Information on the Web
• Searching the parallel data on the web (Resnik 2003)
• Searching the comparable corpus on the web (Fung 1998)
• Anchor texts pointing to the same page (Lu 2004)
Mining Key Phrase Translations
from Web Corpora
20
Bilingual Information on the Web
• Limited bilingual resources as parallel/comparable on the
web
o STRAND: 3,500 English-Chinese document pairs and fewer than
2,500 for English-French. (Resnik 2003 )
o Comparable corpora: from 10 years Xinhua Chinese and English
stories (2GB) only 110K sentence pairs (44MB) are found as
“parallel”. (Zhao & Vogel 2002)
o Anchor text mining: from 2M web pages, 2.8MB Chinese text
and 3.1MB English text found as potential translations.
• More bilingual information on the web in the form of mixed
language webpage
o Parallel text are not needed in most cases
o The Chinese authors usually include the original English for the
key phrases
• For consistency
• To give the readers
Mining
more
Keyinformation
Phrase Translations
from about
Web Corpora
• If they are not sure
the translation in Chinese
21
Web pages of mixed languages
Mining Key Phrase Translations
from Web Corpora
22
Web pages of mixed languages
Mining Key Phrase Translations
from Web Corpora
23
Mining translations from mixed-lang. pages
•
Crawling the Chinese web pages that contain English text.
(Zhang and Vines, SIGIR 2004)
o Use Google to locate the webpages containing the Chinese
terms
o English expressions occur next to the Chinese terms are
considered as their translations
o Crawled 2GB web data, 1,168 distinct English terms found, 61%
are correct translations
•
Searching the Chinese terms among the English pages.
(Cheng et al. SIGIR 2004)
o Use Google to retrieve “English” pages containing the Chinese
terms
o Extract translations from the snippets
o LiveTrans systemMining Key Phrase Translations
24
from Web Corpora
Mining translations from mixed-lang pages
Mining Key Phrase Translations
25
from Web Corpora
Pros and cons of these approaches
Web
Resources
Crawling? Available
Difficulty in
Extraction?
Searching for parallel
data
Yes
Limited
Hard
Searching for
comparable data
Yes
Moderate
Harder
Mining Anchor Text
“Yes”
Limited
Easy
Extracting translation
from mixed-lang page
Yes
Abundant
Moderate
Search in English pages
No
Small
Moderate
Mining Key Phrase Translations
from Web Corpora
26
Cross-lingual Projection
1. Training Data Projection
2. Test Data Projection
3. Model (e.g., pivot features) projection
Training Data Projection
1. Find a large, parallel bilingual corpus
o
E/G part of EUROPARL (25m words)
2. Assign semantic roles on English side
o
Train automatic tagger on English data
3. Project semantics over to a low-resource incident
language
o
o
o
Step 1: Find semantic equivalences via word alignment
Step 2: Project frame
Step 3: Project roles
Result: Large IL annotated corpus
Projection: Example
Arriving
Peter comes home
Arriving
Peter kommt nach Hause
Three assumptions to make this work
Assumption 1
Semantic representation is parallel
Arriving
Peter comes home
Arriving
Peter kommt nach Hause
Assumption 2
There is always parallel lexical material that is
semantically equivalent
Arriving
Peter comes home
Arriving
Peter kommt nach Hause
Assumption 3
Word Alignment provides semantic equivalence
Arriving
Peter comes home
Arriving
Peter kommt nach Hause
Word Alignment as Semantic Equivalence
• Current Word Alignment models use co-occurrence to
determine alignment
o But co-occurrence != semantic equivalence
decide
insist
entscheiden
Entscheidung
treffen
bestehen
darauf
Problems: Phrasal verbs, Idioms, Support Verbs
(Funktionsverbgefuege), Noise proper