Transcript 投影片
國立雲林科技大學
National Yunlin University of Science and Technology
A New Lexicon Mechanism for
Chinese Word Segmentation
Advisor : Dr. Hsu
Graduate : Kuo-min Wang
2006 PACIS
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Motivation
Objective
Introduction
A New Lexicon Mechanism
Experiments
Conclusion
Personal Opinions
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivate
Under the development of global networking through
Internet, the amount of articles in Chinese or other
oriental languages is increasing rapidly.
As the lack of explicit separator, word segmentation
is a precondition for the processing of these
character-based languages and thus affecting the
whole system in performance.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
This paper propose a new solution for Chinese word
segmentation problem based on lexicon named
double-character-and-long-world-hash-indexing
(DCLWHI).
This method can improve the speed and efficiency of
word segmentation without extra memory spending,
and gains the same accuracy.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
The current methods of Chinese word
segmentation are divided into two kinds
Lexicon
Easily accomplished, high level arithmetic efficiency
Out of vocabulary problem (OOV)
(new words, names of people, organizations and locations)
Frequency statistic
Has the advantage on OOV problems
But the arithmetic efficiency is much lower than the
lexicon based method.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A New Lexicon Mechanism
The Double-Character words hold large proportion in
Chinese words.
70% are double-character word [4]
Make a hash indexing for the first two characters of the lexicon words,
then add the remaining string into a special long word table, which has a
hash indexing.
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A New Lexicon Mechanism
First-Double-Character-Hash-Indexing
Flag Bit(2Bytes)
If the two-character is a prefix of a word which length is N, the big N-1 of the 2
bytes will be set 1;
Exaple “圖籍” , which is a double-character word, but can’t be the prefix of
other words, So the Flag Big of 圖籍 is set 0000000000000010(0x0002)
電老 is not a Chinese word, but it can be a prefix of a word 電老虎. So the Flag
Bit is 0000000000000100(0x0004)
Similar examples : 春夏(ox000A),君子(x0006)、敢作(x0008)
Long Word Hash Indexing
Similar to the First-Double-Character-Hash-Indexing.
0000 0000 0000 0010 2-character
0000 0000 0000 0100 3-character
0000 0000 0000 1000 4-character
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A New Lexicon Mechanism
Example of Search->君子當圖籍是電老虎
Pick up first two characters “君子”, Flag Big is x0006 can be a 2character or a prefix of a Treble-Character
word.
君子當圖籍是電老虎
Then shift to the character “當”, compute the hash value of the substring
“君子當”, search in the long word
君子當圖籍是電老虎
Find the marching index, confirm the string , marching succeed.
Shift to Character “圖籍” (0x002)
君子當圖籍是電老虎
Shift to Character “是電”
君子當圖籍是電老虎
There is no value in hash-indexing, 2 situations may happen
First, there is no value in hash-indexing, return one character “是”
Second, there is a substring in the index, but value unequally; return one character
“是”
Shift to Character電老”(0x004)
Shift to Character “虎”
君子當圖籍是電老虎
君子當圖籍是電老虎
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments
Comparison of Searching Cycles
Comparison of Memory Space Cost
Comparison of Speed
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background
Binary-Seek-by-Word
Composed of three parts
Lexicon text, word-index-table, first-character-index-table
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)
TRIE indexing tree
is a multi-chain-table tree, the mechanism is
composed of two parts:
First-character-index table
and TRIE index-tree node
Didn’t need to predict the length
of the word , only need to match
the word by chain-tree
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)
Binary-Seek-by-Characters
Absorbs the search-advantage in TRIE indexing tree,
using searching by characters not searching by words
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)
Summary above methods’ drawbacks
Binary-seek-by-word is using full-words
marching, the efficiency is evidently low.
The design and maintenance of the TRIE tree is
very complex, wastes mass memory space
Binary-seek-by characters
Improves some aspects, but it doesn’t change the
data structure of the binary-seek-by-word which
restrict the efficiency.
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Some novel schemes
Double-Character-Hashindexing[4]
An new searching tree
improved from the TRIE
indexing tree.
Composed of two parts:
Hashing index, remaining
strings.
Can avoids the deep
searching , increases the
segment speed without
complex increasing.
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Some novel schemes (cont.)
A new lexicon mechanism based on PATRICIA[3]
Use of the ISN (internal statement number) of the words as the
key words bit-string,
Constructs the PATRICIA tree by comparing the big-string.
Advantage
The searching process only need some cycles of bit comparison and
some cycles of string comparison.
Double-Array Trie[1]
Even node in the tree stands for a status of an auto-machine,
Which changes according to the difference of the variable.
This new structure actually is an improved scheme of the
TRIE tree, using 2 linear arrays to express the TRIE tree
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Some novel schemes (cont.)
用負的base值表示該位置為詞語。
如果狀態i對應某一個詞,而且Base[i]=0,那麼令Base[i]=(-1)*i,
如果Base[i]的值不是0,那麼令Base[i]=(-1)*Base[i]。得到雙陣列如下:
例如設“阿根”的下標為i=8,那麼check[i]的內容是“阿”的下標,
而base[i]是“阿根廷”的下標的基值。
“廷”的序列碼為x=8,那麼“阿根廷”的下標為base[i]+x=base[8]+8=12。
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Some novel schemes (cont.)
Double-code scheme [1]
Basic idea is mapping the 6768 Chinese characters
in GB-2312 into the sequence-code from 1 to 6768.
Every string written in Chinese can only maps to a
number string,
Composed of two steps:
Switch from number-sequence into even-coding
Establish indexing mechanism
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Analysis of the novel schemes
Double-Character-Hash-Indexing
PATRICIA
Is a super arithmetic in segment speed, but it waste on the memory
space and reduce the efficiency.
Double-Array Trie
Improvement of TRIE index tree, while it is easier structured and
maintained than the former mechanism.
When decrease or increase the lexicon, the whole double-array should
be adjusted.
Double-code scheme
The extract rate of the arithmetic is not good enough, which result in a
very big array, restrict the performance of the search efficiency
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment Detail
大白
0x00E
大白
大白日
大白日夢
大白日 大白日夢
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment Detail
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions
Our mechanism DCLWHI farther improves the
speed and efficiency of segmentation.
The scheme A has a very high process speed
but costs too much memory space, while
scheme B costs less storage with a high
efficiency. We think it a good eclectic
mechanism for Chinese word segmentation.
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Opinions
Experiments are not enough to evidence this
method is very well.
…..
22
Intelligent Database Systems Lab