extraction of katakana variant pairs

Download Report

Transcript extraction of katakana variant pairs

國立雲林科技大學
National Yunlin University of Science and Technology
Web-based acquisition of
Japanese katakana variants
Advisor : Dr. Hsu
Reporter : Wen-Hsiang Hu
Author : Takeshi Masuyama;
Hiroshi Nakagawa
2005, SIGIR
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline







Motivation
Objective
Introduction
ACQUISITION OF STRING PENALTY WITH WEB DATA
EXTRACTION OF KATAKANA VARIANT PAIRS
CONCLUSIONS AND FUTURE WORK
Personal Opinion
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation

Previous works manually :
─
defined Katakana rewrite rules.

─
defined the weight of each operation to edit one string
into another to detect these variants.


%Y(be) and %t%’(ve) being replaceable with each other
The weight of substitutions %Y(be) and %t%’(ve) is 0.8
However, these previous researches have not been
able to keep up with the ever-increasing number of
loanwords and their variants.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective


Acquire new weights of edit operations
automatically
keep up with new Katakana loanwords only by
collecting text data from Web and.
4
Intelligent Database Systems Lab
ACQUISITION OF STRING
WITH WEB DATA
N.Y.U.S.T.
I. M.
(%&%)%C%+(wholtuka), %&%)%H%+(wholtoka)),
(%&%)%C%+(wholtuka), %&%*%C%+(uoltuka)),
(%&%)%C%+(wholtuka), %t%)%C%+(voltuka))
Vodka and
%&%)%C%+(wholtuka)
threshold of edit distance : 2
Collect candidate
Katakana
variant pairs
Google
threshold: 0.00006
stop-words
Calculate the
string penalty (SP)
CLC : character-level context
Extract Katakana
variant pairs
e.g.
f(oltuka)=2
f(oltuka , w←>u)=1
5
f(oltuka , w←>v)=1
Intelligent Database Systems Lab
EXTRACTION OF KATAKANA
VARIANT PAIRS
N.Y.U.S.T.
I. M.
%_%M%i%k%&%)!<%?!<(mineraruwho-ta- for “mineral water”)
%_%M%i%k%&%*!<%?(mineraruuo-ta for “mineral water”)
threshold of string penalty (SP) : 4
We collect Katakana words from
the corpus. We used the pattern
matching of a Katakana
character set.
e.g.
!&(“bullet”), !<(“macron-1”), !](“macron-2”), !=(“macron-3”)
to collect Katakana words such as %_%M%i%k%&%)!<%?!<
(mineraruwho-ta- for “mineral water”).
Extract candidate
Katakana
variant pairs
threshold of cosine similarity : 0.05
Extract Katakana
variant pairs
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment
We conducted paired t-test (rejection region: 5%)
for the cases of SP = 1, 2, and 3 and no significant
difference is detected.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction

The pronunciation of
loanwords does not
necessarily coincide with
that in their original
language.
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (cont.)

We tried to find
how many
documents were
retrieved by
Google when
each Katakana
variant for
spaghetti was
used as a query.
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (cont.)

We will first describe methods based on rewrite rules, which
are described in Table 3. Henceforth, ↔ denotes substitution,
∅ denotes an empty string,…

For example, when they inputted %Y%M%A%" (benechia for
“Venezia”) into their system which applies rewrite rules,
─
─
─
%Y %M %D%# %“ (benetsia)
%t%’ %M %A %“ (venechia)
%t%’ %M %D%# %“ (venetsia)
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (cont.)


It is difficult to keep up with the ever-increasing
number of loanwords and their variants, since they
define rewrite rules manually or assign weights to the
edit distance manually.
We propose a method of mechanically determining
the weights of the string penalty to overcome this
problem.
11
Intelligent Database Systems Lab
Calculation of a string penalty

We used the following five types as character-level
contexts (CLC) of each character targeted by the edit
operation.
─
─
─
─
─
The preceding two characters of the target character,
The preceding character of the target character,
The succeeding two characters of the target character,
The succeeding character of the target character, and
The preceding character and the succeeding character of the
target character.
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental evaluation of a string
penalty
N.Y.U.S.T.
I. M.
Table 6: Correlation of the mechanically
determined SP and the manually
determined SP.
Cov(XY)=E(XY)-E(X)E(Y)
13
We calculated coefficient of correlation of
Table 6 and the value was 0.76.=> strong
Intelligent Database Systems Lab
Experimental evaluation of
Katakana variant pairs (cont.)
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Comparative results for task of
detecting Katakana variants

Table 10 compares the results for Mechanical, Word,
Google, and Yahoo! in terms of detecting Katakana
variants of “spaghetti.”
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Error Analyses

Mechanical could not extract the variant pair
%0%j%:%j!<%Y%"(gurizuri-bea) and %0%j%:%j!<!&%Y%"(gurizuri-!&bea)
both of which denoted “grizzly bear,” since their documentlevel contexts were completely different.
16
Intelligent Database Systems Lab
,
CONCLUSIONS AND FUTURE
WORK




N.Y.U.S.T.
I. M.
We proposed a method of mechanically determining the
weight of each edit operation for identifying Katakana variants,
based on Web data.
Unlike methods presented in previous work, ours could easily
keep up with the increasing number of loanwords.
We also proposed a method of extracting Japanese Katakana
variant pairs from a large corpus based on similarities in
spelling and context.
In our future work, we are planning to calculate SP with a list
of words in other languages and Katakana loanwords.
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinion

Strength
─

automatic method
Application
─
柯林頓


科林頓
克林頓
18
Intelligent Database Systems Lab