Impact of automated translation on mining knowledge from

Download Report

Transcript Impact of automated translation on mining knowledge from

19. 11. 2015, Brno
Luděk Svozil
Impact of automated
translation on mining
knowledge from text data
Kapitola 1
strana 2
Introduction
• Statistical and hybrid machine
translation systems are gaining more
attention
• Apart from commercial services like
Google Translate and Bing, there are
number of projects aiming to bring the
benefits of big data knowledge to endusers
strana 3
EU projects on horizon
• Modern MT – aims to bring powerful,
ready to use MT system to desktop
users
http://www.modernmt.eu/
• LTI cloud – gathers language
technology components for easy use
in information systems
http://www.ltinnovate.org/lticloud
strana 4
• If machine translation is part of
preprocessing, would it benefit the
text-mining procces? And how?
• Earlier experiments have shown that
when combining scarce data across
different languages, MT provides great
simplification of problem
strana 5
Test data and experiment
• 20 000 reviews in 5 languages from
booking.com were subjected to
Google machine translation, stemming
and then c5.0 decision tree was
trained on them and evaluated using
cross-validation
strana 6
Results – % decrease in attributes
count
ES
FR
PL
CS
DE
translation
24%
17%
42%
40%
29%
stemming
37%
31%
20%
33%
16%
translation and
stemming
41%
35%
56%
53%
44%
strana 7
Results – avg. classification error
ES
FR
PL
CS
DE
Original
14,10%
14,10%
12,40%
Translated
14,10%
13,30%
11,30% 12,70% 12,00%
Stemmed
15,30%
14,00%
11,90% 11,80% 13,50%
Translated and stemmed
15,50%
15,50%
12,80%
14,60% 12,70%
13,70% 14,10%
strana 8
• To observe how well the translated
data would combine with native
English, another experiment was
made
• 10 000 English documents were
combined with another 10 000 from
different language, the other language
was then Google translated
strana 9
Results – avg. classification error
EN+FR
EN+PL
EN+DE
EN+ES
original
16,10%
14,80%
14,60%
17,30%
non-English language
translated
33,50%
33,90%
37,70%
36,10%
strana 10
Conclusions
• MT simplifies problem (reduces
dictionary) while doesn’t increase
classification error
• Attention must be paid, while
combining native and translated
documents
strana 11
• Další detaily, testy a porovnání rulebased a MT translátorů najdete v mé
bakalářské práci „Dolování znalostí
z vícejazyčných textových dat“, která
bude k dispozici během ledna-února
2016