Mining The Web for Bilingual Text

Download Report

Transcript Mining The Web for Bilingual Text

By:
Karan Chawla
Nikhilkumar Jadhav
Vinita Sharma
Basics of Bilingual text
2) Approaches to obtain bilingual text
3) STRAND
4) Language Identification
5) Scaling Up the language generation
6) Assessing the STRAND Data
7) Improving STRAND Parameters
8) Content Based Matching
9) Internet Archives
10)Conclusion
1)

Text in two different languages depicting the
same information.
English
Hindi
Sachin Ramesh Tendulkar
(born 24 April 1973) is
an Indian cricketer widely
acknowledged as one of the
greatest batsmen in One
Day
International
.
Tendulkar
has
been
honoured with the Padma
Vibhushan award in 2008,
and the Rajiv Gandhi Khel
Ratna award, India's highest
sporting honour.
सचिन रमेश तेंदल
ु कर ( जन्म: 24
अप्रैल 1973, मम्
ु बई) क्रिकेट के
इतिहास में विश्ि के
सिवश्रेष्ठ बल्लेबाजों में गिने जािे
हैं। सगिन राजीि िाांधी खेल
रत्न परु स्कार से सम्मातनि
एकमात्र क्रिकेट खखलाडी हैं। िे सन ्
२००८ में पद्म विभष
ू ण से भी
परु स्कृि क्रकये जा िुके है ।




Domain Specific
Hard and Expensive with human effort
Licensing restrictions
Very few pair of languages


Content of web page is written in two
languages.
For Example:
◦ Foreign language learning
◦ Online stores list their products in multiple
languages
Machine Learning Based
Alignment Based
Content Based matching Approach


Mining Bilingual sentences directly from the
webpage.
Pattern Learning



The patterns vary in different pages, so
it is impossible to mine the translation
pairs using predefined templates.
Some pages contain consistently formatted
texts in two languages but they are not
translation pairs
Not all translations in a collective bilingual
page necessarily follow an exactly
consistent format.
Web
pages
Transliteration
Model
Bilingual
Dictionary
Translation
pairs
Input
Depends
Preprocessing
Seed Mining
Depends
Pattern
Learning
Output
Pattern-based
Mining
𝑓𝑤 𝑥 = < 𝑤, 𝑥 >
•where
• is the feature vector of a pattern pi
• is the weight vector.
•Features :
Generality
•The percentage
of those
bilingual snippet
pairs which
match pi
Average
Translation Score
•The average
translation score
of all bilingual
snippet pairs
which can match
pi
Length
•The length of pi
Irregularity
•The standard
deviation of the
number of noisy
snippets
• Mine parallel web documents from web sites
• Extract bilingual sentence from mixed parallel data
• Using Sentence alignments methods
Structural Translation
Recognition for Acquiring
Natural Data




Conceptually simple
Fully language Independent
Scalable
High accuracy without human intervention
Candidate
Generation Phase
WWW
Candidate Pair
Evaluation
Candidate Pair
Filtering
(Language
Dependent )
Translation
Pairs
Candidate Pair Generation
•<url1,url2> : web pages that may be translations of
each other
Candidate Pair Evaluation
•Filtering
•Language independent
Language dependent Filtering
•Additional filtering criteria


Input

lang1 and lang2

Find Web-pages containing
Using Altavista



at least one hyperlink where lang1
appears in the text or URL associated with
the link
at least one such hyperlink for lang2
anchor:”lang1” AND anchor:”lang2”

Each page from step 1 is processed to
extract

Pairs <url1,url2> appearing in anchors a1,a2
respectively

Should be no more than 10 lines apart

Query may be to fetch



Documents in lang2 containing anchor in
lang1
Example: Pages in French containing
containing pointers in English
Exploit parallel URL or directory structure


http://foo.boo/en/program.html
http://foo.boo/fr/program.html

This stage exploits two facts




Parallel pages are filled with great deal of
identical HTML markup
Linear relationship in lengths of text
translations
Pieces of identical markup are reliable
points of correspondence
Best alignment of markup and non-markup
chunks between two documents
1.
Linearize
◦
A document is converted into a linear
sequence containing tokens
Align the linearized sequences
3. Threshold the aligned, linear sequences
based on mismatches
4. Compute a confidence value
2.

[START:element_label]
o

[END:element_label]
o

e.g. [START:A], [START:LI]
e.g. [END:A]
[Chunk:length]
o
e.g. [Chunk:174]

sdiff utility



Does a fine job of alignment
Matching up identical START and END
tokens in the sequence
Align Chunk tokens having almost similar
lengths in such a way as to minimize
differences between two sequences
<HTML>
<TITLE>Emergency
Exit</TITLE>
<BODY>
<H1>Emergency Exit</H1>
If seated at an exit and
.
.
<HTML>
<TITLE>Sortie de
Secours</TITLE>
<BODY>
Si vous ˆtes assis `a
cˆt ́ d’une...
.
.
[START:HTML]
[START:TITLE]
[Chunk:12]
[END:TITLE]
[START:BODY]
[START:H1]
[Chunk:12]
[END:H1]
[Chunk:112]
[START:HTML]
[START:TITLE]
[Chunk:15]
[END:TITLE]
[START:BODY]
[Chunk:122]

Using the alignments, 4 scalar parameters
are calculated which characterizes the quality
of the alignment
dp
n
r
P
• The
• The number
The
• The
Quantifies •
the
extent to which
difference Quantifies
of aligned
correlation
Helps to significance
the
characterize the
extent to
which there are of lengths
percentage,
nonmarkup
level of the
dp
<
20%
mismatches
in
the
quality of the
arechunks
indicating chunks
text
of
the
correlation
p
<
0.05
alignment
correlated in alignment
was usedaligned
nonshared
of unequal
r
length.
material
length
nonmarkup
chunks

Two documents maybe parallel upto a point
but one goes on to include more details

e.g. Wikipedia page for Language in English
and French



English version has more content
Mismatch proportion exceeds K then the
pair is discarded
In the experiments conducted, K was 20%


<X, Y> = {(x1 , y1 ), . . . , (xn , yn )} be the set
of pairs of lengths of aligned chunk tokens
such that xj not equal to yj
For the example,




<X, Y> = {(12, 15), (112, 122), . . .}
Compute the Pearson correlation coefficient
Retain pairs with high correlation coefficient
Output is set of aligned non-markup texts


Extract text
Use Language models to retain highly
probable text

The original STRAND did not have language
dependent processes such as:


Automatic language identification, content
based comparison of structurally aligned
documents etc.
Essential to incorporate these processes in
order to handle the false positives.


Make sure that the two pages are written in
the language they are supposed to be
written.
Statistical language identification based on
character n-gram was added to the system.


Pr(L1|d1) > Pr(L2|d1) and
Pr(L2|d2) > Pr(L1|d2)


L1, L2 are the two languages
D1, d2 are the two documents



STRAND with the new language
identification model was run for English and
Spanish.
Top 1000 hits generated by Altavista: 913
candidate pairs selected.
Test set: 179 items for annotation by
human judges, containing:



All the pairs marked GOOD by STRAND
(Passed both the filters)
All the pairs filtered out via language
identification
A random sample of pairs filtered out
structurally

STRAND is being tested NOT on its ability to
determine translation quality but its ability to
locate document pairs that can be included
in the corpus.
Comparison
N
Pr(Agree)
k
J1,J2
106
0.85
0.70
J1,STRAND
165
0.91
0.79
J2,STRAND
113
0.81
0.61
J1∩ J2, STRAND
90
0.91
0.82
Precision : 92.1
Recall : 47.3
English-Spanish Evaluation


Recall could be increased by making
structural filtering more intelligent, for
example ignoring markups when computing
alignments.
However, since the number of translation
pairs M on the Web is large, half of M is also
large. So we try to bring the total yield
closer to M.


Need more candidates
Notion of Sibling pages


page in one language contains a link directly
to the translated page in the other language
STRAND was extended to permit both
parent and sibling search criteria.


Reqd all the results that meets the search
criteria.
It gives only top 100 results.

Solution:


Filter queries on the basis of dates
For example:
one version of a query would restrict the
result to pages last updated on 1 Jan
2013, the next 2 Jan 2013, and so forth.


The scaled-up system was run for EnglishFrench document pairs in November, 1998
Generation component produced 16763
candidate page pairs - an 18-fold increase
over the previous experiment.




3153 page pairs were eleminated because
they are either exact duplicates or
irretrievable.
STRAND's structural filtering removed 9820
candidate page pairs
Language identification component
removed another 414.
The remaining 3376 language pairs were
considered GOOD.
Comparison
N
Pr(Agree)
k
J1,J2
267
0.98
0.95
J1,STRAND
273
0.84
0.65
J2,STRAND
315
0.85
0.63
J1∩ J2, STRAND
261
0.86
0.68
Precision : 79.5
Recall : 70.3
English-Spanish Evaluation
Comparison
N
Pr(Agree)
k
J1,J2
267
0.98
0.95
J1,STRAND
273
0.88
0.70
J2,STRAND
315
0.88
0.69
J1∩ J2, STRAND
261
0.90
0.75
Precison : 100
Recall : 64.1
English-French evaluation with stricter language ID criterion

Resnik, Oard, and Levow (2001) showed that
a translation lexicon automatically extracted
from the French-English STRAND data could
be combined productively with a bilingual
French-English dictionary in order to improve
retrieval results using a standard crosslanguage IR test collection.


Conducted a ratings-based assessment of
English-Chinese data, asking two native
Chinese speakers to assign ratings to a set of
English-Chinese items.
Dataset consisted:
◦ 30 human-translated sentence pairs from the FBIS
corpus
◦ 30 Chinese sentences from the FBIS corpus and
translation output from Alta Vista’s Babelfish1
◦ 30 paired items from Chinese-English Web data,
retrieved from STRAND
1. http://babelfish.altavista.com

Asked 3 types of ratings:
◦ Assessing English Fluency
◦ Assessing Chinese Fluency
◦ Adequacy of the translation
Translation adequacy ratings: Distribution over scores for human-translated
(Human), Web-generated (WWW), and machine-translated (MT) data.


STRAND’s structure-based filter consistently
throws out around one-third of the candidate
document pairs it has found in order to
maintain its precision in the 98–100% range.
They used the 4 structural values (dp, n, r,
and p) as features characterizing each
document pair



Done 9-fold cross validation experiment
using decision tree induction (C5.0).
Each fold had 87 test items and 174 training
items
The fraction of good and bad pairs in each
fold’s test and training sets was roughly
equal to the overall division (33% to 67%,
respectively).
Parameter
Value
dp
37
n
11
Precision
Recall
Untuned
1.000
0.686
Fold 1
0.875
1.000
Fold 2
0.857
0.667
Fold 3
1.000
Fold 4
1.000
0.923
Fold 5
1.000
0.857
Fold 6
1.000
1.000
Fold 7
0.889
0.667
Fold 8
1.000
0.889
Fold 9
1.000
0.545
Average
0.958
0.841
1.000



Ma and Liberman (1999) pointed out, not all
translators create translated pages that look
like the original page.
Structure-based matching is applicable only
in corpora that include markup tags.
Multilingual collections are available that
contain parallel text without structural tags


It gives importance to similarity of content,
whether or not similarities of structure
exist.
Presented a generic score of translational
similarity that is based upon any word-toword translation lexicon .


tsim :Cross-language similarity score
link : a pair (x, y) in which x is a word in
language L1 and y is a word in L2
tsim = 4/7
Comparison
N
Pr(Agree)
k
J1,J2
267
0.98
0.95
J1,STRAND
273
0.88
0.70
J2,STRAND
315
0.88
0.69
J1∩ J2, STRAND
261
0.90
0.75
Precison : 100
Recall : 64.1
English-French evaluation with stricter language ID criterion
Comparison
N
Pr(Agree)
k
J1,tsim
249
0.96
0.92
J2,tsim
243
0.95
0.88
J1∩ J2, tsim
280
0.97
0.92
Precison : 83.3
Recall : 92.1
English-French evaluation with similarity measure
Parameter
Value
tsim
0.432
dp
22.9
Precision
Recall
Untuned STRAND
1.000
0.686
Tuned STRAND (avg)
0.958
0.841
tsim = 0.44
0.833
0.921
Fold 1
0.875
1.000
Fold 2
1.000
1.000
Fold 3
1.000
1.000
Fold 4
1.000
1.000
Fold 5
1.000
1.000
Fold 6
0.889
1.000
Fold 7
1.000
1.000
Fold 8
1.000
1.000
Fold 9
1.000
0.818
Average
0.974
0.980


A non-profit organization attempting to
archive the entire publicly available Web,
preserving the content and providing free
access to researchers, historians, scholars,
and the general public.
It contains approx. 120TB of data.

Generating candidate pairs on the Archive
involves the following steps:
1. Extracting URLs from index files using simple
pattern matching
2. Combining the results from step 1 into a single
huge list
3. Grouping URLs into buckets by handles
4. Generating candidate pairs from buckets
5. After this, structural filtering is applied.
Precision
Recall
Baseline (on the full set)
0.8993
1.0000
Without tuning
Fold 1
1.0000
0.0227
Fold 2
1.0000
0.1364
Fold 3
0.6667
0.0435
Average
0.8889
0.0675
Fold 1
0.9111
0.9318
Fold 2
0.9302
0.9090
Fold 3
0.9565
0.9565
Average
0.9326
0.9324
With Tuning
Precision
Recall
Baseline
0.8993
1.0000
Structure only (tuned, avg)
0.9326
0.9324
Content only (tsim = 0.058)
0.7688
0.9935
Fold 1
0.9167
1.0000
Fold 2
0.9767
0.9545
Fold 3
0.9583
1.0000
Average
0.9506
0.9848



Resnik et. al. (1999) places acquisition of parallel
text from the Web on solid empirical footing.
The system has been extended with automated
language identification and was scaled up by
including sibling pages.
In the process, it was discovered that the most
lightweight use of language identification removes
a lot of false positives.


Rigorous evaluation using human judges suggests
that the technique produces an extremely clean
corpus and noise estimated at between 0 and 8%
even without human intervention.
Context based matching technique is shown to
perform competitively to the structure-based
approach of STRAND on the task of identifying
English-French document translations.

This work addresses the problems of very little
impact of parallel corpora on the community:





Too few languages
Too little data
Difficulty with dissemination
The fact that substantial English-Arabic parallel
data was collected, means it will be possible to
find data for the less well-represented pairs.
This work was completely independent of
language, thus success in mining depends
primarily on whether the data exist.

Two Directions:

Experiments needs to be done using
languages that are less common on the
Web.
Application of Web-based parallel text in
applications like lexical acquisition and
cross-language IR.



Improving the quality of the tsim score by
handling of the noisy translation lexicon.
Use of Bootstrapping paradigm for the
construction of parallel corpora.
Use this parallel data in the work on MT and
acquisition of bilingual lexicons.
1.
Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu, Qingsheng Zhu.
2009. Mining Bilingual Data from the Web with Adaptively Learnt Patterns. In
Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the
AFNLP.
2.
3.
Resnik, P. 1999. Mining the web for bilingual text. In Proceedings of
the 37th Annual Meeting of the Association of Computational
Linguistics (ACL).
Philip Resnik. 1998. Parallel strands: A preliminary investigation into
mining the web for bilingual text. In Proceedings of the Third
Conference of the Association for Machine Translation in the
Americas.
4.
5.
Ma, Xiaoyi and Mark Liberman. 1999. Bits:A method for bilingual
text search over the Web. In Machine Translation Summit VII.
Resnik, P., Noah A. Smith. 2003. The Web as a Parallel Corpus. In
Proceedings of the 39th Annual Meeting of the Association of
Computational Linguistics (ACL).