The Web as a Parallel Corpus

Transcript The Web as a Parallel Corpus

The Web as a
Parallel Corpus
A paper by Philip Resnik and Noah A. Smith
(2003, Computational Linguistics)
My interpretation of their research.
http://www.thebritishmuseum.ac.uk/compass/ixbin/goto?i
d=OBJ67
Contents:





Introduction to parallel corpora
The STRAND Web-mining
architecture (estb.1999)
Content-Based Matching
Exploiting the Internet Archive
Conclusions and Further Work
Introduction to parallel corpora




The Rosetta Stone dates back from around 190 BC.
The three texts on the RS are of the same content in
hieroglyphs, demotic and Greek. (3 different
languages)
Canadian Hansard and Hong Kong Hansard are two
other famous parallel corpora, especially because
they are available electronically and are of high
standards.
Motivation:Bitexts provide indispensable training
data for statistical translation models.
The Web can be mined for suitable bilingual and
multilingual texts.
STRAND: Web-Mining Architecture(1)






Structural Translation Recognition Acquired Natural
Data (STRAND) is the authors’ software to search for
pairs of Web pages that are translations of each other.
Using more parallel texts is always to the advantage of
machine translation research and implementation.
How STRAND works?
1)Location of pages that might have parallel translations: Looking
for “parent pages” and “sibling pages”. The web page writer most
probably has a language link such as “Chinese”or “Arabic”
imbedded in the page.
2)Generation of candidate pairs that might be translations: Seeing
if the pairs have the same HTML structure.
3)Structural filtering out the non-translation candidate pairs:
Searching the content of the pairs.
STRAND: Web-Mining Architecture(2)


1) Locating pairs: Candidate pairs are typically from one
Web-site. STRAND looks for ‘sibling’ pairs. These pages
are often linked to each other by links which offers the user:
Francais, espanol, or other options.
2)Generating pairs:For many web sites The URLs are
compared: http://www.ottawa.ca/index_en.html
…………...http://www.ottawa.ca/index_fr.html.

3)Structural filtering: First look at the HTML structure. Webpage writers often use the same or very similar template.
Next, we use a markup analyzer using three(3) tokens to
produce a linear reproduction of each of the two “candidate
web-pages”………………..over
STRAND: Web-Mining Architecture(3)

Candidate pairs:

<HTML>
<TITLE>City Hall</TITLE>
<BODY>
<H1>Regional Government<H1>
The business………

Candidate pairs: Now formed into 2 linear alignments





[START:HTML]
[START:TITLE]
[Chunk: 8]
[END: TITLE]
[START: H1]
[Chunk:18]

……………….over





<HTML>
<TITLE>Hotel de Ville</TITLE>
<BODY>
Les affaires……….
[START:HTML]
[START:TITLE]
[Chunk: 12]
[END: TITLE]
[START: BODY]
[Chunk: 138]
Using these 2 linear alignments






We use four scaler values to characterize the quality of the
alignment:
dp(difference percentage)= mismatches of alignments (that
is, tokens that don’t match)
n= number of aligned non-markup text chunks.
r= correlation of lengths of the aligned non-markup chunks
p= level of significance of the correlation r.
Next the analysts can manually set the thresholds
of these parameters and check the results. 100%
precision and 68.6% recall has been obtained
using STRAND to find English-French Web pages.
Optimizing Parameters Using Machine Learning

A ninefold cross-validation experiment using decision tree
induction was used to predict the class assigned by the
human judges. The learned classifiers were substantially
different from the manually-set (heuristic) thresholds.
Manually-set: 31% of good document pairs were discarded
ML-set: 16% of good pairs are discarded.(4%false positive)

Other Related Work



Some analysts use Parallel Text Miner (PT Miner) using
already existing search engines to locate pages that are
likely to be in the other language of interest.





Other Related Work /Other Linguistic Researchers
Some analysts use Parallel Text Miner (PT Miner) using
already existing search engines to locate pages that are likely
to be in the other language of interest. Then a final filtering
stage is undertaken to clean the corpus.
Bilingual Internet Text Search(BITS) is used by other
researchers and utilizes different matching techniques.
STRAND, PTMiner, and BITS are all largely independent of
linguistic knowledge about particular languages, and therefore
very easily ported to new language pairs.
Reskin has looked into:English-Arabic, English-Chinese(big5),
and English-Basque.
Mining the Web



Researchers can and do mine the internet every day.
An American physicist (Barabasi) has had his team
look at the size, shape and structure of the internet as
well as hit-frequencies of numerous Web pages.
Spiders or crawlers are used in research.
The Internet Archive(www.archive.org/web/researcher/ ) is also
instrumental in obtaining useful information.
The internet Archive

The internet archive is a nonprofit organization attempting to
archive the entirely publicly available Web, preserving the
content and providing free access to researchers, historians,
scholars, and the general public.
(120terabytes of information in 2002)
Over 10 billion Web pages.

Properties of the Archive:







1)The archive is a temporal database, but it is not stored in temporal order.
2)Extracting a document is an expensive operation.(text extraction.)
3)Computational complexity must be keep low for mining this database.
4)Data relevant for linguistic purposes are clearly available.
5)A suite of tools exist for linguistic processing of the archive.
Building an English-Arabic Corpus




Step 1:Search for English-Arabic pairs. Look at 24 top-level
national domains for countries where Arabis is spoken:
Egypt(.eg), Saudi Arabia(.sa), Kuwait(.kw). Also other .com
domains believed to be useful to Arabic-speaking people.
Step 2:Resnik et al. mined two crawls of the internet
archive comprising 8TB and 12TB. Relevant domains
numbered 19,917,923 pages.
Step 3: Only 8,294 pairs of English-Arabic bitexts were
found.
EVALUATION:
Conclusions and Further Work





Initial web searches for parallel texts were undertaken in
1998. Resnick’s report is from 2002. The author laments
the lack of different languages available on the internet as
well as the lack of data made available by some countries.
The growth of both the internet and the internet archive will
considerably add to the expansion of parallel corpora.
Chen and Nie(2000), for example have found around
15,000 English-Chinese document pairs.
One of the early STRAND projects for English-Chinese
parallel texts found over 70,000 pairs.
Because STRAND expects pages to be very similar in
structure terms, the resulting document collections are
particularly amenable to sentence- or segment-level
alignment.

The Web as a Parallel Corpus

Transcript The Web as a Parallel Corpus

Directory