Semi-automatic Extensions using Bayesian Inference
Download
Report
Transcript Semi-automatic Extensions using Bayesian Inference
Arabic WordNet: Semi-automatic
Extensions using Bayesian
Inference
H. Rodríguez1, D. Farwell1, J.
Farreres1, M. Bertran1, M. Alkhalifa2,
M.A. Martí2
1
2
Talp Research Center, UPC, Barcelona, Spain
UB, Barcelona, Spain
LREC 2008 AWN
1
Index of the talk
• The AWN project
• Semi-automatic Extensions of AWN
Intuitive basis
Previous work using heuristics
Using Bayesian Networks
• Empirical evaluation
• Conclusions
LREC 2008 AWN
2
The AWN project
• USA REFLEX program funded (2005-2007)
Partners:
Universities
Companies
Princeton, Manchester, UPC, UB
Articulate Software, Irion
Description:
Black et al, 2006
Elkateb et al, 2006
Rodríguez et al, 2008
LREC 2008 AWN
3
The AWN project
• Objectives
10,000 synsets including some amount of domain
specific data
linked to PWN 2.0
finally to PWN 3.0
linked to SUMO
+ 1,000 NE
manually built (or revised)
vowelized entries
including root of each entry
LREC 2008 AWN
4
The AWN project
• Current figures
Arabic synsets
11270
Arabic words
23496
pos
DB content
adj
nouns
adv
verbs
661
7961
110
2538
Named entities:
Synsets that are named entities
Synsets that are not named entities
Words in synsets that are named entities
LREC 2008 AWN
1142
10028
1656
5
Semi-automatic Extensions of AWN
• Intuitive basis
In Arabic (and other Semitic Languages) many
words having a common root (i.e. a sequence of
typically three consonants) have related
meanings and can be derived from a base verbal
form by means of a reduced set of lexical rules
LREC 2008 AWN
6
Semi-automatic Extensions of AWN
LREC 2008 AWN
7
Semi-automatic Extensions of AWN
• Lexical rules
regular verbal derivative forms
regular nominal and adjectival derivative forms
masdar (nominal verb)
masculine and feminine active and passive participles
inflected verbal forms
LREC 2008 AWN
8
Semi-automatic Extensions of AWN
• Procedure for generating a set of likely
<Arabic word, English synset, score>:
produce an initial list of candidate word forms
filter out the less likely candidates from this list
generate an initial list of attachments
score the reliability of these candidates
manually review the best scored candidates and
include the valid associations in AWN.
LREC 2008 AWN
9
Semi-automatic Extensions of AWN
• Resources
PWN
AWN
LOGOS database of conjugated Arabic verbs
NMSU bilingual Arabic-English lexicon
Arabic Gigaword Corpus
UN (2000-2002) bilingual Arabic-English Corpus
LREC 2008 AWN
10
Semi-automatic Extensions of AWN
• Score the reliability of the candidates
build a graph representing the words, synsets and
their associations
apply a set of heuristic rules that use directly the
structure of the graph
associations synset-synset:
explicit in WN2.0
path-based
GWC 2008
apply Bayesian inference
LREC 2008
LREC 2008 AWN
11
Using Bayesian Inference
...
E1
A1
...
...
A b a se
...
Ei
...
S1
...
...
An
Ej
...
Sp
...
Em
LREC 2008 AWN
12
Using Bayesian Inference
layers
1
2
...
3
E1
A1
S21
...
...
...
...
Ei
S11
...
...
...
An
Ej
...
4
S1p
S2r
...
Em
LREC 2008 AWN
13
Using Bayesian Inference
• Building the CPT for each node in the BN
edges EW AW
probabilities from statistical translation models built
from the UN corpus using GIZA++ (word-word
probabilities) filtered to avoid pairs having Arabic
expressions with invalid Buckwalter encodings.
all the mass probability is distributed between pairs
occurring in the BN
other edges (EW S, S S)
linear distribution on priors
noisy or model
LREC 2008 AWN
14
Using Bayesian Inference
• Performing Bayesian Inference in the BN
Assign probability 1 to nodes in layer 1
Infer the probabilities of nodes in layer 3
Select for each word in layer 1 select as
candidates the synsets in layer 3 connected to it
and with probability over a threshold
Score the candidate pair with this probability
Select the candidates scored over a threshold
LREC 2008 AWN
15
Empirical Evaluation
• 10 verbs randomly selected from AWN + درس
Arabic verb
َ َع ا َم ل
ََأ َ ْع قَ ب
ص قَ َل
َ
َ َرت َّ ب
َ أ َ َّخ ر
َ أ َ ْخ بَ ر
َّ َر
َش ح
َغ ا َم َر
أ َ ْش بَ ع
َ أ َ ْخ َرج
َ دَ َّرس
# English Words # Synsets (S1 S2)
107
190
71
77
31
21
62
102
19
9
80
105
40
22
56
49
38
34
85
140
57
51
LREC 2008 AWN
16
Empirical Evaluation
• Results
Selection
HEU
HEU
BN
BN
BN
BN
BN
BN + HEU
BN + HEU
BN + HEU
BN + HEU
BN + HEU
Threshold
all heuristics
heuristics 1,2
0
0.01
0.02
0.07
0.1
0
0.01
0.02
0.07
0.1
candidates accept
reject
precision
272
61
135
40
137
21
0.50
0.65
554
243
214
112
100
272
212
201
92
83
223
125
116
65
60
154
121
115
65
59
331
118
98
47
40
118
91
86
27
24
0.40
0.51
0.54
0.58
0.60
0.56
0.57
0.57
0.71
0.71
LREC 2008 AWN
recall
0.61
0.18
1
0.56
0.52
0.29
0.27
0.69
0.54
0.41
0.38
0.12
F1
0.55
0.28
0.57
0.53
0.53
0.39
0.37
0.62
0.55
0.48
0.5
0,21
17
Conclusions
• the BN approach doubles the number of
candidates of the previous HEU approach (554
vs 272).
• The sample is clearly insufficient.
• The overlaping of Heu + BN seems to improve
the results
• An analysis of the errors shows a substantial
number were due to the lack of the shadda
diacritic or the feminine ending form (ta
marbuta, )ة.
LREC 2008 AWN
18
Further work
• Repeat the entire procedure relying when
possible on dictionaries containing diacritics
• Refine the scoring procedure by assigning
different weights to the different relations.
• Include additional relations (e.g. path-based)
• Use additional Knowledge Sources for
weighting the relations:
related entries already included in AWN
SUMO
Magnini's domain codes
LREC 2008 AWN
19
Thank you for your attention
LREC 2008 AWN
20