Machine Translation Projects in WING
Download
Report
Transcript Machine Translation Projects in WING
Machine Translation Projects at
WING
Min-Yen Kan
Web IR / NLP Group (WING)
School of Computing
National University of Singapore
Min-Yen Kan
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
3
Min-Yen Kan
CSIDM
• A research institute in Singapore wholly owned by the
Institute of Automation, Chinese Academy of Sciences
(CASIA)
– CASIA holds substantial IP on Chinese ASR and NLP
• Set up with funding from Media Development
Authority of Singapore
• Will involve over 40 manpower, split between
Singapore and Beijing, with help from NUS
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
4
Goal: The world’s language learning hub
QuickTime™ and a
Mediate communication
TIFF (LZW) decompressor between
are needed to see this picture.
Chinese
and English speakers
Min-Yen Kan
CSIDM Projects
• Personalized Language Learning
• Core NLP
• OOV Mining and Data Acquisition for MT
• Immersive 3D environments
WING’s focus
– Accurate multi-modal sensing and capture for virtual
avatars
– Real time rendering models
• Dialog, mixed-initiative, multi-modal engine
– Gesture recognition also proposed
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
6
Min-Yen Kan
Project objectives
PI: Kan Min-Yen
CSIDM Partner: Prof. Zhao Jun
•
Chinese language learning via translation
assistance
•
Explore co-training human computation
framework
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
7
Min-Yen Kan
Translation assistance: Objectives
•Assist English speakers to learn Chinese
–Supply translation and explanation of words when users
browse Web pages in their native tasks
•How?
–Browser based
–Synergistic Machine Translation and Human Computation
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
8
Min-Yen Kan
State of the art: Similar tools
• Kingsoft
– Memory-based translation tool
• Lingoes
– Integrated lexical service platform based on lexica
– Also supplies definition and example sentences
– Only supplies a framework - actual resources supplied by
third parties
– Focuses on easy input (i.e. getting words or phrases from
the screen) and display the translations
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
9
Min-Yen Kan
Our Solution - Human Computation
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
10
Min-Yen Kan
Technical approaches
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
HOANG
Cong Duy
Vu
Improving the
MT System
Machine
Translation
Human
Computation
Testing the
Human Learner
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
Co-training
framework
11
Min-Yen Kan
Expected Impact
• Browser based 2nd language learning extension
simultaneously help learners and MT systems
improve
• Explore both human computation and standard web
mining as methods to obtain data for MT
• Will address speakers of one language helping their
counterpart in the other language
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
12
Min-Yen Kan
WING current MT work
• Function Word Syntax-Based (FWS) MT
– Student: Hendra SETIAWAN (now postdoc at U
Maryland)
– Use “function” words as the basis for movement
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
• Unsupervised Morphological Machine
Translation
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
– Student: Thang Minh LUONG (undergraduate)
– Translating from low inflected language to high ones
in a generic framework
– Use unsupervised morphology package as source of
info
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
13
Min-Yen Kan
表单 是 网页 上 的 数据 输入 域 的 集合
a_form is a_page on
MSRA Regional NLP Workshop (Tokyo, Japan)
data
15-16 Dec 2008
entry fields of a_collection
14
Min-Yen Kan
The idea is to use function words
表单 是 网页 上 的 数据 输入 域 的 集合
a_form is
on a_page
MSRA Regional NLP Workshop (Tokyo, Japan)
data
15-16 Dec 2008
entry fields of a_collection
15
Min-Yen Kan
The idea is to use function words
表单 是 网页 上 的 数据 输入 域 的 集合
a_form is data
MSRA Regional NLP Workshop (Tokyo, Japan)
entry fields on a_page
15-16 Dec 2008
of a_collection
16
Min-Yen Kan
The idea is to use function words
表单 是 网页 上 的 数据 输入 域 的 集合
a_form is a_collection of
MSRA Regional NLP Workshop (Tokyo, Japan)
data
15-16 Dec 2008
entry fields of a_page
17
Min-Yen Kan
The idea is to use function words
frequency
Most frequent words ≈ function words
•Easy
to
identify
表单 是 网页 上 的 数据 输入 域 的 集合
•More amenable to statistical modeling
•More compact
most frequent
MSRA Regional NLP Workshop (Tokyo, Japan)
least frequent
15-16 Dec 2008
18
Min-Yen Kan
Morphological MT
Words
Standard MT
Words
Unsupervised
Monolingual
Segmentation
Inflection
Reconstruction
(ME, CRF)
Morphemes
MSRA Regional NLP Workshop (Tokyo, Japan)
Morphological MT
15-16 Dec 2008
Morphemes
19
Min-Yen Kan
Conclusion
• WING just a part of larger NUS / Singapore-wide push
to examine language learning
• CSIDM research center established, linking S’pore
with CASIA researchers
• At WING:
– Prospective: MT data acquisition a la human computation
– Current: Function word and morphological MT
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
20
Min-Yen Kan
Plugs 1/2
•ACL-IJCNLP 2009 Singapore
* Feb 22, 2009
* Apr 12, 2009
* Apr 26, 2009
* May 17, 2009
* May 31, 2009
* Jun 7, 2009
* Aug 2-7, 2009
MSRA Regional NLP Workshop (Tokyo, Japan)
Full paper submissions due;
Full paper notification of acceptance;
Short paper submissions due;
Camera-ready full papers due;
Short Paper notification of acceptance;
Camera-ready short papers due;
ACL-IJCNLP 2009
15-16 Dec 2008
21
Min-Yen Kan
Plugs 2/2
• Workshop on text and citation
analysis for scholarly digital libraries
(NLPIR4DL)
MSRA Regional NLP Workshop (Tokyo, Japan)
15-16 Dec 2008
22