Transcript Document
Using the Web for Translation
Disambiguation
at NTCIR-5
Ying Zhang
Phil Vines
School of Computer Science and Information
Technology, RMIT University
Introduction
• Dictionary-based query translation is a
widely used approach in cross-language
information retrieval (CLIR)
ex :
恐怖
/terror/consternation/funk/monstrousness
• Solution of disambiguation
various techniques utilizing statistics
obtained from the test collection corpus.
problem
• In a production system, one would not be
able to use a constrained test collection for
disambiguation.
Work of this paper
• to see how well techniques perform when
the web is used to provide context for
disambiguation.
Query translation
Query translation
1
1
Chinese OOV translation
• The basis of the approach is the observation
that most translated English terms tend to
accompanied by the original English terms on
the web typically immediately after the Chinese
text
• Ex:
1
Chinese OOV translation
Step1:
Use Google to fetch the top 300 Chinese documents, using the entire
Chinese query.
Step2:
Where English text occurs, check the immediately preceding Chinese
text to see if it is a substring of the Chinese query
Step3:
Select the English text e with the highest frequency
Step4:
For this English text e, select the associated Chinese query substring
c with the highest co-occurrence frequency
Step5:
If the selected Chinese query substring c cannot be found in the
Chinese segmentation dictionary, we treat it as OOV term and add it into
the Chinese segmentation dictionary and (c,e) into the translation
dictionary
Query translation
Query translation
2
2
Structured English query
query Q = (c1,c2, …, cn). Ei ={ei1, ei2, …, eim} is
translation of ci
1. The translation sets Ei of all query terms are
combined with the logical operator AND
2. The candidate translations eij (j Є [1,m]) of
a query term are enclosed in the parentheses
and combined with the logical operator OR
3. Phrases are enclosed in quotation marks as
units
Query translation
3
3
Using the web
• use Google to fetch up to 300 top-ranked
documents using the structured queries
generated in last ppt.
• The retrieved documents are then filtered
to remove HTML tags and metadata,
leaving only the web text as the corpus to
provide context for disambiguation.
Query translation
4
4 Co-occurrence statistics Using an HMM Model
• Chinese query (c1,c2, …, cn).
• each English translation candidate set E is
(e1,e2, …, en).
• use a probability model P(E) = (e1,e2, …, en)to
estimate the maximum likelihood (ML) of each
sequence of words.
• English translations E with the highest P(E)
among all possible translation sets is selected.
4 Co-occurrence statistics Using an HMM Model
• f(e) is the corpus frequency of term e
• N is the number of terms occurring in the corpus
• fw(e, e’) is the frequency of term e’ occurring after term e within a window size w
• nk represents the number of terms with the collection frequency k
Experiment
• Chinese query
from NTCIR-5 , 49 Chinese queries
• English document
from NTCIR-5 , 259,050 news articles from 2000 to 2001.
Query form :
<TOPIC>
<NUM>001</NUM>
<SLANG>CH</SLANG>
<TLANG>CH</TLANG>
<TITLE>秋鬥,訴求,勞工,抗議,台灣</TITLE>
<DESC>查詢台灣勞工秋鬥遊行的訴求內容以及政府在1998年所提出的勞工政策。</DESC>
<NARR>
<BACK>台灣勞工每年11月12日會舉行秋鬥大遊行。我想知道1998年勞工們向行政院勞委會提出的訴
求以及勞委會當時所承諾勞工的政策重點有哪些。</BACK>
<REL>勞工的訴求視為相關。勞委會回應訴求所提出勞工政策重點也視為相關,遊行抗議的過程則視
為不相關。</REL>
</NARR>
<CONC>勞工,抗議,勞委會,訴求,勞工政策</CONC>
</TOPIC>
Experiment
Results
T : TITLE part
D : DESC part
T-mono D-mono : mono-lingual reference
-collection : using the test collection
-web : the English web documents extracted by a search engine
conclusion
• experimental results show that when using
the web, it is possible to achieve
effectiveness comparable to that obtained
with a test collection