3. Kyoto Text Corpus (KTC) - National Centre for Language
Download
Report
Transcript 3. Kyoto Text Corpus (KTC) - National Centre for Language
Automatic conversion of a Japanese text
corpus into f-structure
OYA, Masanori (大矢 政徳)
1
29/11/06
National Center for Language
Technology
School of Computing, Dublin City
University
NCLT Seminar series
1. Overview
2
Japanese grammar
Kyoto Text Corpus (KTC)
Converting KTC into dependency trees
Converting KTC into f-structure
Problems
Evaluation
Summary
NCLT Seminar series
29/11/06
2. Japanese grammar
Syntax
–
–
–
–
–
–
–
–
–
–
–
3
Writing system
SOV as the basic word order
Use of particles for grammatical functions
Tense, aspect and mood are specified by verbal or adjectival morphology
“bunsetsu” (sentential units)
Ellipses of core arguments
Topicalization
Two types of relative clauses
Case particles derived from verbs
Adverbial nouns
Coordination
NCLT Seminar series
29/11/06
2. Japanese grammar
Writing system: three different types of scripts
– Chinese characters (1945 and more)
Nouns (possible to be written in Hiragana or Katakana)
Stems of verbs and adjectives
– Hiragana (104)
Inflections of verbs and adjectives
Particles
Words that have no Chinese counterparts
– Katakana (124)
Nouns borrowed from foreign languages
Technical and scientific names
Onomatopoeia
–
4
No spaces are given between words
NCLT Seminar series
29/11/06
2. Japanese grammar
The chart of Hiragana
a
i
e
o
あ
い
う
え
a
i
u
e
o
き
く
け
こ
ki
ku
ke
ko
k か
ka
g が
ぎ
ぐ
げ
ご
gi
gu
ge
go
s さ
し
す
せ
そ
sa
shi
su
se
so
z ざ
じ
ず
ぜ
ぞ
ji
zu
ze
zo
t
た
ち
つ
て
と
ta
chi
tsu
te
to
ぢ
づ
で
ど
ji
zu
de
do
に
ぬ
ね
の
ni
nu
ne
no
ひ
ふ
へ
ほ
hi
fu
he
ho
び
ぶ
べ
ぼ
bi
bu
be
bo
d だ
da
n な
na
h は
ha
b ば
ba
p ぱ
pa
m ま
ma
ぴ
ぷ
ぺ
ぽ
pi
pu
pe
po
み
む
め
も
mu
me
mo
mi
y や
ゆ
ya
r
yu
a
yo
きゃ きゅ きょ
kya
kyu
kyo
ぎゃ ぎゅ ぎょ
gya
gyu
sha
shu
ja
ju
jo
ちゃ ちゅ ちょ
cha
chu
cho
nyu
nyo
ひゃ ひゅ ひょ
hya
hyu
hyo
びゃ びゅ びょ
bya
byu
byo
ぴゃ ぴゅ ぴょ
pya
pyu
pyo
みゃ みゅ みょ
mya
myu
myo
る
れ
ろ
ra
ri
ru
re
ro
を
りゃ りゅ
rya
ryu
りょ
ryo
n’
ya
yu
yo
ウ
u
エ
e
オ
o
k
カ
ka
キ
ki
ク
ku
ケ
ke
コ
ko
キャ
kya
キュ
kyu
キョ
kyo
g
ガ
ga
ギ
gi
グ
gu
ゲ
ge
ゴ
go
ギャ
gya
ギュ
gyu
ギョ
gyo
サ
sa
シ
si
ス
su
セ
se
ソ
so
sha
シ
shi
シュ
shu
シェ
she
ショ
sho
za
ジ
zi
ズ
zu
ゼ
ze
ゾ
zo
ジ
ji
ジュ
ju
ジェ
je
ジョ
jo
ニャ
nya
ニュ
nyu
ニョ
nyo
ヒャ
hya
ヒュ
hyu
ヒョ
hyo
ビャ
bya
ビュ
byu
ビョ
byo
ピャ
pya
ピュ
pyu
ピョ
pyo
ミャ
mya
ミュ
myu
ミョ
myo
リャ
rya
リュ
ryu
リョ
ryo
ザ
j ジャ
ja
t
ta
ティ
ti
ツ
tsu
テ
te
ト
to
cha
チ
chi
チュ
chu
チェ
che
チョ
ch
o
ダ
da
ディ
di
デュ
du
デ
de
ド
do
ナ
na
ニ
ni
ヌ
nu
ネ
ne
ノ
no
ハ
ha
ヒ
hi
フ
hu
ヘ
he
ホ
ho
f ファ
fa
フィ
fi
フ
fu
フェ
fe
フォ
fo
b
ba
ビ
bi
ブ
bu
ベ
be
ボ
bo
ヴィ
vi
ヴ
vu
ヴェ
ve
ヴォ
vo
ピ
pi
プ
pu
ペ
pe
ポ
po
ミ
mi
ム
mu
メ
me
モ
mo
ユ
yu
ヨ
yo
d
nya
o
i
タ
ch チャ
にゃ にゅ にょ ん
e
イ
z
sho
じゃ じゅ じょ
u
a
sh シャ
しゃ しゅ しょ
i
ア
s
gyo
yo
り
wa
yu
よ
ら
w わ
ya
お
ga
za
5
u
The chart of Katakana
n
h
バ
v ヴァ
va
p
パ
pa
m
マ
ma
y
ヤ
ya
r
ラ
ra
ワ
wa
w
o
NCLT Seminar series
リ
ri
ウィ
wi
ル
ru
ウ
wu
レ
re
ウェ
we
ロ
ro
ウォ
wo
ヲ
o
ン
n
29/11/06
2. Japanese grammar
6
SOV as the basic word order; scrambling is prevalent
Use of particles for grammatical functions
Example:
太郎はダブリンの大学に行った。
Taro-wa dabulin-no daigaku-ni it-ta
Taro-TOP Dublin-in college-to go-PST
“Taro went to a college in Dublin.”
– “-wa”, “-ga”, “-wo” and “-ni” – used for core grammatical functions
– Other particles – used for adjuncts (postpositional phrases or
complementizer)
(Tsujimura 2006)
– The particle “-ni” is ambiguous; it can be used as the OBL case marker or
a postposition for temporal or locative adverbials (semantic distinction is
possible).
NCLT Seminar series
29/11/06
2. Japanese grammar
Tense, aspect and mood are specified by verbal or adjectival
morphology
Example:
太郎はダブリンの大学に行っている。
Taro-wa dabulin-no daigaku-ni it-teiru
Taro-TOP Dublin-in college-to go-PROG.PRES
“Taro is going to a college in Dublin.”
太郎はダブリンの大学に行ったのだろうか。
Taro-wa dabulin-no daigaku-ni it-ta-nodarou-ka
Taro-TOP Dublin-in college-to go-PST-AUX-INT
“(I wonder) whether Taro went to a college in Dublin.” etc.
7
NCLT Seminar series
29/11/06
2. Japanese grammar
“Bunsetsu”, or syntactic units
–
One bunsetsu = a content word + a particle or inflection
≈ Chinese characters + hiragana
or katakana
Example:
太郎はダブリンの大学に行っている。
Taro-wa dabulin-no daigaku-ni it-teiru
Unit 0
Unit 1
Unit 2
Unit 3
• Spaces represent bunsetsu boundaries.
8
• Hyphens represent morphological boundaries within a bunsetsu.
NCLT Seminar series
29/11/06
2. Japanese grammar
9
Ellipses of core arguments
– Pro-drop: contextually-evident units are absent from the sentence
– Gender, person and number of the subject are not specified by verbal or
adjectival morphology
Example:
ダブリンの大学に行った。
dabulin-no daigaku-ni it-ta
Dublin-in college-OBL go-PST
“I/We/You/He/She/It/They/(Someone in the context) went to a college in
Dublin.”
-Personal pronouns are also available, but they are not equivalent with personal
pronouns in English (e.g., variations of 1st singular personal pronouns: ‘ore’,
‘atashi’, ‘boku’, ‘watashi’, ‘watakushi’, etc.; variations of 2nd singular
personal pronouns: ‘kimi’, ‘anata’, ‘anta’, ‘omae’, etc)
NCLT Seminar series
29/11/06
2. Japanese grammar
Topicalization
–
–
Topicalized units have the particle “wa”
Non-topicalized units are the focus of the sentence
Example:
ダブリンの大学には太郎が行った。
dabulin-no daigaku-ni-wa
Taro-ga
it-ta
Dublin-in college-OBL-TOP Taro-NOM go-PST
“To a college in Dublin, Taro went.” or “It is Taro who went to a
college in Dublin
10
NCLT Seminar series
29/11/06
2. Japanese grammar
Relative clauses
–
If a clause ends with a verb in a sentence-ending form
and it comes before a noun, then the clause is a relative
clause: Japanese has no relative pronouns.
Example:
私が行った大学
watashi-ga itta daigaku
1sg-NOM go-PST college
“the college I went to.”
11
NCLT Seminar series
29/11/06
2. Japanese grammar
12
Two types of “relative clauses”; “inner” relative clauses (true relative clauses)
and “outer” relative clauses (appositions) (Teramura 1991)
Example:
–
私が答えを見つけた証拠
watashi-ga kotae-wo
mitsuketa shoko
1sg-NOM answer-ACC find-PST evidence
“The evidence that I found out the answer” (“outer”)
–
私が見つけた証拠
watashi-ga mitsuketa shoko
1sg-NOM find-PST evidence
“The evidence that I found out ∅”
(“inner”: ∅ =evidence)
“The evidence that I found out PRO” (“outer”: PRO≠evidence;
something evident in the context)
–
If one of the core arguments is in ellipsis, then it is difficult to distinguish a
true relative clause from an apposition.
NCLT Seminar series
29/11/06
2. Japanese grammar
Particles derived from verbs:
–
Some case particles are derived from verbs; case particles of this
type have a fixed meaning (Masuoka and Takubo 1992)
Example:
ついて tsuite “about” (same form with the adverbial form of the verb つ
く “attach”)
私は計算言語学について話した。
Watashi-wa keisangengogaku-ni-tsuite hanashi-ta
I-TOP
computational linguistics-OBL-about talk-PST
“I talked about computational linguistics.”
13
NCLT Seminar series
29/11/06
2. Japanese grammar
Adverbial nouns
They function as the head of an adverbial phrase with a complement
(Masuoka and Takubo 1992)
Example:
ダブリンの大学に通っている時、津波が日本を襲った。
Dabulin-no daigaku-ni kayotteiru toki, tsunami-ga
nihon-wo osot-ta.
Dublin-in college-OBL go-PROG time, tsunami-NOM Japan-ACC strike-PST
“When I was studying at a college in Dublin, a tsunami struck Japan.”
–
–
14
It is also difficult to distinguish the complements in these cases from relative
clauses; no syntactic nor morphological clues are available.
NCLT Seminar series
29/11/06
2. Japanese grammar
Coordination
– The first coordinated bunsetsu has the particle “to” (but not necessarily), and
it is dependent on the next coordinated bunsetsu.
Example:
ダブリンの大学に通っている時、地震と津波が日本を襲った。
Dabulin-no daigaku-ni kayotteiru toki, jishin-to
tsunami-ga
nihon-wo
osot-ta.
Dublin-in college-OBL go-PROG time, jishin-AND tsunami-NOM Japan-ACC
strike-PST
“When I was studying at a college in Dublin, an earthquake and a tsunami
struck Japan.”
–
15
Only the last coordinated bunsetsu has the particle which specifies its
grammatical function;
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
16
An automatically parsed text corpus of a
newspaper (Mainichi Shimbun)
All articles from the 1st to the 17th of January,
1995 (19,687 sentences, 518,687 tokens) and the
editorials from January to December, 1995
(18,708 sentences, 453995 tokens).
Developed by Sadao Kurohashi and Makoto
Nagao at the University of Kyoto, using JUMAN
and KNP
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
17
All the texts are automatically annotated with
morphological tags by JUMAN (Kurohashi and Nagao
1994) (“juman” means 100,000)
The output of JUMAN are parsed by KNP (Kurohashi and
Nagao 1994) based on the dependency among “bunsetsu”,
and corrected manually
No syntactic CFG category tags are annotated
Valency of verbal predicate is not annotated
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
JUMAN: morphological analyzer for Japanese based on Bigram
information
Least-cost path method (Kurohashi and Kawahara 1992)
–
Costs are assigned to each morpheme and each pair of two morphemes in
a sentence
–
18
The lower the morpheme frequency, or the lower the frequency of pairs of
morphemes, the higher the cost
If a sentence has several possible analyses, JUMAN sums up the costs,
and determines the least-cost analysis as the most plausible analysis for
the sentence
Accuracy: around 99.0 % (comparison of automatic analysis and
manually corrected analysis of 10,000 sentences)
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
The example of the output of JUMAN:
太郎は大学に行った。 “Taro went to a college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
19
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
The example of the output of JUMAN:
太郎は大学に行った。 “Taro went to a college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
#S-ID: 950101001-001
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
20
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
21
KNP: dependency structure analyzer based on
“bunsetsu”
KNP converts the output of JUMAN into a
bunsetsu strings.
Accuracy: 90%(comparison of automatic analysis
and manually corrected analysis of 10,000
sentences) (Kurohashi and Nagao 1998)
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
太郎は大学に行った。 “Taro went to a college.”
Taro-wa daigaku-ni it-ta.
Taro TOP college OBL went
#S-ID: 950101001-001
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
22
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
太郎は大学に行った。 “Taro went to a college.”
Taro-wa daigaku-ni
it-ta.
Taro TOP college OBL went
23
#S-ID: 950101001-001
* 0 2D
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
*1 2D
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
*2 -1D
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
NCLT Seminar series
29/11/06
3. Kyoto Text Corpus (KTC)
太郎は大学に行った。 “Taro went to a college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
24
#S-ID: 950101001-001
* 0 2D
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
*1 2D
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
*2 -1D
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
NCLT Seminar series
Unit 0
Unit 1
Unit 2
29/11/06
3. Kyoto Text Corpus (KTC)
大学に太郎は行った。 “Taro went to a college.”
daigaku ni Taro wa itta.
college OBL Taro TOP went
25
#S-ID: 950101001-001
*0 2D
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
*1 2D
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
*2 -1D
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
NCLT Seminar series
Unit 0
Unit 1
Unit 2
29/11/06
4. Converting KTC into dependency trees
Motivation:
–
Related work:
–
–
–
26
LFG-based automatic grammar induction for Japanese;
GramLab: Treebank based Acquisition of Multilingual Resources
(Cahill et al. 2002, etc.)
Japanese XLE at Fuji Xerox (Masuichi et al. 2006, etc. )
PCFG-based Automatic grammar induction from Japanese
Corpus (Tokunaga et al. 2005, etc.)
Case frame induction from Japanese Corpus (Kurohashi et al.
2006, etc.)
NCLT Seminar series
29/11/06
4. Converting KTC into dependency trees
Procedure:
At
Text corpus
Dependency trees
least one syntactic category
is annotated on each "bunsetsu"
in a sentence.
All “bunsetsu’ in a sentence
are integrated into a dependency
tree of the sentence.
F-structures
27
NCLT Seminar series
29/11/06
4. Converting KTC into dependency trees
太郎は大学に行った。 “Taro went to a college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
28
#S-ID: 950101001-001
* 0 2D
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
*1 2D
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
*2 -1D
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
NCLT Seminar series
Unit 0
Unit 1
Unit 2
29/11/06
4. Converting KTC into dependency trees
太郎は大学に行った。 “Taro went to college.”
Taro wa daigaku ni
itta.
Taro TOP college OBL went
29
#S-ID: 950101001-001
* 0 2D
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
*1 2D
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
*2 -1D
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
NCLT Seminar series
Topic:
OBL:
Unit 0
TopP
Unit 1
NP
Unit 2
V
29/11/06
4. Converting KTC into dependency trees
Taro
30
wa
daigaku ni
itta
NCLT Seminar series
。
29/11/06
5. Converting KTC into f-structures
Motivation:
–
Are syntactic categories necessary for Japanese?
31
Word order is (relatively) free.
The type (or absence) of particles in each unit specifies its
grammatical function (e.g., if a noun has a particle “wo”, then
it is an object)
Verbal morphology specifies the grammatical function of each
clause (but not always unambiguous).
NCLT Seminar series
29/11/06
5. Converting KTC into f-structures
Generating f-structure equations directly from the corpus
Text corpus
Dependency trees
F-structures
32
NCLT Seminar series
29/11/06
5. Converting KTC into f-structures
Generating f-structure equations directly from the corpus
Text corpus
F-structures
33
F-structure equations
are directly generated
from each unit.
All the units are
unified into the fstructure of the sentence
according to the
dependency.
NCLT Seminar series
29/11/06
4. Converting KTC into dependency trees
太郎は大学に行った。 “Taro went to a college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
34
#S-ID: 950101001-001
* 0 2D
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
*1 2D
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
*2 -1D
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
NCLT Seminar series
Unit 0
Unit 1
Unit 2
29/11/06
4. Converting KTC into dependency trees
太郎は大学に行った。 “Taro went to college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
35
#S-ID: 950101001-001
* 0 2D
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
*1 2D
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
*2 -1D
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
NCLT Seminar series
Topic:
OBL:
f0
f1
f2
29/11/06
4. Converting KTC into dependency trees
太郎は大学に行った。 “Taro went to college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
Functional equations from the corpus:
36
#S-ID: 950101001-001
* 0 2D
太郎 tarou * Noun Name * *
は wa * Particle AdverbialParticle * *
*1 2D
大学 daigaku * Noun NormalNoun * *
に ni * Particle CaseParticle * *
*2 -1D
行った itta iku Verb * ConsonantVerb Past
。* mark period * *
EOS
NCLT Seminar series
F2:pred = '行く',
F2:tns = 'pst',
F2:stmt = 'decl',
F2:style = 'plain',
F0:pred = '太郎',
F0:prtav = 'は',
F0 elm F2:topic,
F1:pred = '大学',
F1:case = 'に',
F2:obl = F1.
29/11/06
4. Converting KTC into dependency trees
太郎は大学に行った。 “Taro went to college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
F2:pred = '行く',
F2:tns = 'pst',
F2:stmt = 'decl',
F2:style = 'plain',
F0:pred = '太郎',
F0:prtav = 'は',
F0 elm F2:topic,
F1:pred = '大学',
F1:case = 'に',
F2:obl = F1.
37
NCLT Seminar series
29/11/06
4. Converting KTC into dependency trees
太郎は大学に行った。 “Taro went to college.”
Taro wa daigaku ni itta.
Taro TOP college OBL went
F-structure from the functional equations:
F2:pred = '行く',
F2:tns = 'pst',
F2:stmt = 'decl',
F2:style = 'plain',
F0:pred = '太郎',
F0:prtav = 'は',
F0 elm F2:topic,
F1:pred = '大学',
F1:case = 'に',
F2:obl = F1.
38
pred : '行く'
tns : pst
stmt : decl
style : plain
topic : 1 : pred : '太郎'
prtav : 'は'
obl : pred : '大学'
case : 'に'
NCLT Seminar series
29/11/06
5. Problems
This “Generating f-structure equations directly from the
corpus” method does not always work well.
–
–
–
–
–
39
Core argument ellipses
Two types of relative clauses
Particles derived from verbs
Adverbial nouns
Coordination
The context among units must be taken into consideration
to make the generation more accurate.
NCLT Seminar series
29/11/06
5. Problems
40
Ellipses of core arguments
– Contextually-evident units are absent from the sentence
– Gender, person and number of the subject are not specified by verbal
or adjectival morphology
Example:
ダブリンの大学に行った。
dabulin-no daigaku-ni it-ta
Dublin-in college-OBL go-PST
“He/She/They went to a college in Dublin.”
NCLT Seminar series
29/11/06
5. Problems
Core argument ellipses
–
–
–
41
KTC does not annotate on missing elements.
No equations for missing elements can be generated
from KTC.
For the f-structure with ellipses, “PRO” must be added
to make the f-structure complete.
NCLT Seminar series
29/11/06
5. Problems
Core argument ellipses
–
–
–
–
42
If a predicate has no subject in the clause, then an equation for the
subject is added.
If a transitive verb has no object, then an equation for the subject
must be added …
However, KTC does not annotate on the valency of verbal
predicate, hence it is impossible to tell which verb is transitive
only on the basis of annotated information.
Case-frame is required to detect missing objects of transitive
verbs.
NCLT Seminar series
29/11/06
5. Problems
Two types of “relative clauses”; “inner” relative clauses (true relative clauses)
and “outer” relative clauses (appositions) (Teramura 1991)
Example:
私が答えを見つけた証拠
watashi-ga kotae-wo
mitsuketa shoko
1sg-NOM answer-ACC find-PST evidence
“The evidence that I found out the answer” (“outer”)
私が見つけた証拠
watashi-ga mitsuketa shoko
1sg-NOM find-PST evidence
“The evidence that I found out ∅”
(“inner”: ∅ =evidence)
“The evidence that I found out PRO” (“outer”: PRO≠evidence; something evident in
the context)
If one of the core arguments is in ellipsis, then it is difficult to distinguish an “outer”
relative clause from an “inner” relative clause.
43
NCLT Seminar series
29/11/06
5. Problems
Two types of relative clause
–
–
–
44
Features in one bunsetsu are not enough to distinguish
them.
A probabilistic model of analysing them (Abekawa and
Okumura 2005) employs the cooccurrence probability
of head nouns and verbal predicates in “outer” relative
clauses.
This method is expected to be applicable to the present
method (in future).
NCLT Seminar series
29/11/06
5. Problems
Case particles derived from verbs
–
–
–
45
Case particles of this type are analyzed by KNP as
verbs, not as case particles.
Bunsetsus with them are analyzed as sentential
adjuncts, not as postpositional adjuncts or as
complements (in the case of “という”).
The equations must be revised properly.
NCLT Seminar series
29/11/06
5. Problems
Particles derived from verbs:
Some case particles are derived from verbs; case particles of this
type have a fixed meaning (Masuoka and Takubo 1992)
Example:
ついて tsuite “about” (same form with the adverbial form of the verb
つく “attach”)
私は計算言語学について話した。
Watashi-wa keisangengogaku-ni-tsuite hanashi-ta
I-TOP
computational linguistics-OBL-about talk-PST
“I talked about computational linguistics.”
46
NCLT Seminar series
29/11/06
5. Problems
Adverbial nouns
They function as the head of an adverbial phrase with a complement (Masuoka
and Takubo 1992)
Example:
ダブリンの大学に通っている時、津波が日本を襲った。
Dabulin-no daigaku-ni kayotteiru toki, tsunami-ga
nihon-wo osotta.
Dublin-in college-OBL go-PROG time, tsunami-NOM Japan-ACC strike-PST
“When I was studying at a college in Dublin, a tsunami struck Japan.”
–
47
It is also difficult to distinguish the complements in these cases from relative
clauses; no syntactic nor morphological clues are available.
NCLT Seminar series
29/11/06
5. Problems
Adverbial nouns
–
–
48
Features in one bunsetsu in not enough to distinguish
between them.
If a clause is dependent on an adverbial noun and it is
analyzed as a relative clause, then the equation of the
clause must be replaced by that of complement.
NCLT Seminar series
29/11/06
5. Problems
Coordination
– The first coordinated bunsetsu has the particle “to” (but not necessarily), and
it is dependent on the next coordinated bunsetsu.
Example:
ダブリンの大学に通っている時、地震と津波が日本を襲った。
Dabulin-no daigaku-ni kayotteiru toki, jishin-to
tsunami-ga
nihon-wo
osotta.
Dublin-in college-OBL go-PROG time, jishin-AND tsunami-NOM Japan-ACC
strike-PST
“When I was studying at a college in Dublin, an earthquake and a tsunami
struck Japan.”
–
49
Only the last coordinated bunsetsu has the particle which specifies its
grammatical function;
NCLT Seminar series
29/11/06
5. Problems
Coordination
– Only the last coordinated bunsetsu has the particle which specifies its
grammatical function; other coordinated bunsetsus cannot be
analyzed to have appropriate grammatical functions.
–
–
50
The last coordinated bunsetsu does not have any feature within it as a
coordinate; the bunsetsu context must be taken into consideration in
order to convert it properly into f-structure equations
Dependency among coordinated bunsetsus must also be reanalyzed;
NCLT Seminar series
29/11/06
5. Problems
Coordination
Dependency among coordinated bunsetsus must be
reanalyzed;
Example:
–
jishin-to
51
tsunami-ga
NCLT Seminar series
29/11/06
5. Problems
Coordination
Dependency among coordinated bunsetsus must be
reanalyzed;
Example:
–
jishin-to
tsunami-ga
Jishin-to
tsunami
-ga
The coordinated bunsetsus are the elements of a new unit, which
constitutes a new bunsetsu with the case particle (“ga” in this
example).
52
NCLT Seminar series
29/11/06
5. Problems
Among these problems, the following problems
still remain in the method:
–
–
–
53
Object ellipses
Distinguishing two types of relative clauses
Particles derived from verbs
NCLT Seminar series
29/11/06
6. Evaluation of the method
54
200 sentences were randomly selected from KTC.
F-structures of these sentences are automatically
generated by the method.
These f-structures are manually corrected, and
used as the Gold standard.
The automatically generated f-structure of these
200 sentences are compared with the Gold
standard.
NCLT Seminar series
29/11/06
6. Evaluation of the method
Pred-only GFs PRECISION(%) RECALL(%) F-SCORE(%)
adj
80.60
96.43
87.80
cj
100.00
96.80
98.37
comp
74.19
58.97
65.71
obj
98.73
82.54
89.91
obl
85.62
91.91
88.65
padj
98.45
91.81
95.01
rel
70.86
96.40
81.68
sadj
82.26
65.38
72.86
subj
93.17
92.29
92.73
topic
98.68
95.51
97.07
88.26
86.80
86.98
55
NCLT Seminar series
29/11/06
7. Future work …
56
The method of generating f-structure equations
directly from the dependency-based corpus of
Japanese needs more improvement.
The result can be applied to improve the parsing
result of KNP.
Using Japanese f-structures in MT
NCLT Seminar series
29/11/06
References
Abekawa, T and M. Okumura. 2005. Corpus-Based Analysis of Japanese Relatie Clause Constructions. IJCNLP 2005 pp.
46-57.
Cahill A, Cahill A, M. McCarthy, J. van Genabith and A. Way . Automatic Annotation of the Penn-Treebank with LFG FStructure Information. LREC 2002 workshop on Linguistic Knowledge Acquisition and Representation, pp. 8-15
Kurohashi, S and D.Kawahara. 1992. JUMAN: user's manual. ms.
Kurohashi, S and M. Nagao. 1994. A syntactic analysis method of long Japanese sentences based on the detection of
conjunctive structures. Computational Linguistics, 20(4), pp. 507-534.
Kurohashi, S and M. Nagao. 1998. Building a Japanese Parsed Corpus while Improving the Parsing System. Proceedings
of the 1st International Conference on Language Resources and Evaluation, pp. 719-724.
Kurohashi, S, D, Kawahara and T. Shibata. 2005. Morphological and syntactic analyses using JUMAN/KNP. ms.
Masuoka, T and Y. Takubo. 1992. Kiso nihongo bunpo. Kuroshio Publication.
Noguchi, M, H, Ichikawa, T, Hashimoto and T. Takenobu. 2006. A new approach to syntactic annotation. Proceedings of
5th International Conference on Language Resources and Evaluation (LREC2006). pp.6
Noro T, C, Koike, T, Hashimoto, T, Tokunaga and H. Tanaka. 2005. Evaluation of a Japanese CFG Derived from a
Syntactically Annotated Corpus with respect to Dependency Measures. The 5th Workshop on Asian Language
Resources. pp.9
Shibatani, M. 1990. The Languages of Japan. Cambridge University Press
Teramura, H. 1991. Nihongo no shintakusu to imi. Kuroshio Publication
Tomoko Ohkuma, Hiroshi Masuichi, and Takeshi Yoshioka. 2006. Disambiguation of Japanese Focus Particles by using
Lexical Functional Grammar. Journal of Natural Language Processing, 13(1):27-52.
Tsujimura, N. 2006. An Introduction to Japanese Linguistics (2nd ed.). Blackwell Publications
57
NCLT Seminar series
29/11/06
Thank you very much!
58
NCLT Seminar series
29/11/06