Transcript Experiment
國立雲林科技大學
National Yunlin University of Science and Technology
Iterative Translation Disambiguation
for Cross-Language Information
Retrieval
Advisor : Dr. Hsu
Presenter : Yu-San Hsieh
Author
: Christof Monz and Bonnie J. Dorr
2005.SIGIR.520-527
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Motivation
Objective
Approach
Experiment Result
Introduction
Experiment
Conclusions
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
Many words or phrases in one language can
be translated into another language in a
number of way, so translation ambiguity is
very common ,that impacting the
effectiveness of information retrieval.
Elfmeter (Soccer)
Penalty (English)
Strafe (punishment)
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
Finding a proper distribution of translation
probabilities that can solve the translation
ambiguity problem.
4
Intelligent Database Systems Lab
europa
europe
N.Y.U.S.T.
I. M.
Approach
gewerbe
geschaeft
Find a proper of translation
probabilities.
Computing Term Weight
─
Initialization Step
─
Iteration Step
─
Normalization Step
handel
union
gewerkschaft
union
trade
ex :
wT1 (ti ,1 | si )
0.0833 2 *1 0.2 * 2 0.2 * 2
─
All term weights in a vector
─
Iteration Stop
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Approach
Measuring
association strength
─
Pointwise mutual information
─
Dice coefficient
─
Log Likelihood ratio
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment Result
baseline
Improve
Differences
7
Individual queries (topic)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
Two techniques for cross-language retrieval
─
─
Translate collection of document into target language
and apply monolingual retrieval
Translate the query into target language and apply
translated query retrieval
Three approach may be used produce the
translations
─
─
─
Machine translation system
Dictionary
Parallel corpus to estimate the probabilities
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
One language translation into another language
in a number ways.
─
Penalty (English) => Elfmeter (soccer) or Strafe
(punishment)
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
A approach can solve the problem of word
selection is to use co-occurrences between
term.
Problem (a larger number of terms)
─
Data-sparseness
Use very large corpora for counting co-occruences frequencies
Use internet search engines
Smoothing
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment
Test Data
─ CLEF 2003 English to German bilingual data
─ Choice 56 topic (title, description, narrative)
Morphological Normalization
─ Source-language word (topic) normalized to match in bilingual
dictionary
─ De-compounding:5-grams
─ Assign weights to 5-gram substrings
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment
Retrieval Model
─
Lnu.Itc weighting scheme
─
Weighted document similarity
Statistical Significance
─
Bootstrap method
Bootstrap sample
One-tailed significance testing (compare two retrieval method)
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment
Found some problem in experiment
─
Individual average precision of Log Likelihood ratio
decreases for a number of query.
Unknown word
The original word from the source language is include in the
target language query.
Example
Women’s Conference Beijing
Result
Not find : Woman
Women
(專有名詞)
normalized
Women
Women
Assign weighted =1
1.Woman control document simliarity
2.Most top-ranked documents contain
Women as the only matching term.
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions
Our approach improve retrieval effectiveness
compare to baseline using bilingual dictionary
lookup.
Experimental result show that Log Likelihood
Ratio has the strong positive impact.
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
My opinion
Advantage:
Disadvantage:
It only requires a bilingual dictionary and a
monolingual corpus in the target language.
Unknown word
Apply
15
Intelligent Database Systems Lab