Handling of Missing Values in Lexical Acquisition

Download Report

Transcript Handling of Missing Values in Lexical Acquisition

LREC 2010, La Valletta, Malta, May 2010
Handling of missing values in lexical acquisition
Handling of missing values
in lexical acquisition
Núria Bel
Universitat Pompeu Fabra
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
1
Handling of missing values in lexical acquisition
By Automatic Lexical Information
Acquisition we ..
LREC 2010, La Valletta, Malta, May 2010
try to
find how to build repositories of language
dependent lexical information automatically. Many
technologies behind applications (MT, IE, Automatic
Summarization, Sentiment Analysis, Opinion Mining,
Question Answering, etc.) do need this information to
work
("paralelo"
AST
("fiesta"
NST
ALO "paralel"
ALO "fiest"
ATR
POST
CL
(PF-AS SF-A)
CL
(PF-AS PM-OS SF-A SM-O)
GD
(F)
FC
(NPP)
KN
MS
LY
AMENTE
PLC
(NF)
MC
("a")
PLC
(NG)
TYN
(ABS)
PRED (ESTAR SER)
AUTHOR "juan"
TA
(OBJ-P REL)
DATE "28-Aug-99"
AUTHOR "juan"
SITE "FB52")
DATE "31-Aug-99"
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
SITE "FB52")
Entries borrowed from
MT system Incyta (Metal
2
family)
Handling of missing values in lexical acquisition
LREC 2010, La Valletta, Malta, May 2010
Cue Based Lexical Acquisition
• Differences in the distribution of certain contexts
separate words of different classes (Harris,
1951).
• For example: some / *many mud
• Words (types) can be represented in terms of a
collection of contexts where their occurrence or
not in these contexts is taken as hints or cues for
a word to be classified as being of a particular
class.
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
3
Handling of missing values in lexical acquisition
LREC 2010, La Valletta, Malta, May 2010
Word’s occurrences are represented as
vectors and used to train a classifier.
@data
15,2,8,4,0,8,1,0,1,0,0,0,0,0
Number of times the word has been observed in each of
the defined contexts.
Non occurrence in particular contexts is as informative
as occurrence.
We use supervised classifiers (Support Verb Machines,
Decision Trees) to predict the class (Abstract, Mass, etc.)
of new words.
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
4
Handling of missing values in lexical acquisition
LREC 2010, La Valletta, Malta, May 2010
Cues, classification and state-of-theart results
• Merlo and Stevenson (2001) selected very specific cues
for classifying verbs into a number of Levin (1993) based
verbal classes: animacy of the subject, passives, ...
• Baldwin (2005) used general features, such as the pos
tags of neighboring words for type classification.
• Joanis et al. (2007) used the frequency of filled syntactic
positions or slots, tense and voice of occurring verbs,
etc., to describe the whole system of English verbal
classes.
• Difficult to compare the results, but .. an accuracy of
about 70%
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
5
Handling of missing values in lexical acquisition
LREC 2010, La Valletta, Malta, May 2010
The problem: missing values
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
6
Handling of missing values in lexical acquisition
The Sparse data problem
•
•
LREC 2010, La Valletta, Malta, May 2010
•
•
•
Joanis and Stevenson, 2003; Joanis et al. 2007; Korhonen et al. 2008
mention that they have to face the problem of sparse data, many of the
types/words are low in frequency and show up very little information.
Most of the words will appear very little (i.e. Zipff distribution) and therefore
will show few cues.
Yallop et al. (2005) calculated that in the 100M-word British National
Corpus, from a total of 124,120 distinct adjectives, 70,246 occur only once.
The cues we can use as information are mutually exclusive, i.e. an
adjective can be prenominal and postnominal, but if it only occurs once, it
will only show one cue, the other ones being a zero value.
Even when appearing more frequently, the optional nature and variety of the
contexts of occurrence are the origin of missing values also for those types
that occur more than once.
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
7
Handling of missing values in lexical acquisition
LREC 2010, La Valletta, Malta, May 2010
Zero values and learning
• Zero values create not only a problem of enough information to
decide, but a further uncertainty when learning from the data.
• A zero value could be indeed a negative value, i.e. the cue is that it
has not been observed, but it could be that the cue was just not
observed in the examined corpus because of various reasons
• When there are many zero values, the cue loses its predictive
power because of the mentioned uncertainty.
• Katz (1987) and Baayen and Sproat (1996), among others,
acknowledged the importance of preprocessing low frequency
events and Joanis et al. (2007) also decided to smooth the data,
even working with more than 1000 occurrences per verb in the BNC.
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
8
Handling of missing values in lexical acquisition
LREC 2010, La Valletta, Malta, May 2010
Our smoothing experiment: Harmonization based
on linguistic information
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
9
Handling of missing values in lexical acquisition
Intuitively:
How likely is that a 0 is just an
unobserved feature and not a true 0, given the
values of other observations?
LREC 2010, La Valletta, Malta, May 2010
To classify Abstract/Concrete nouns in English:
Cue 1 is “suffix “–ness”, “-ism”, …. For Abstracts (Light 1996)
Cue 2 is “determiners “such”, “little”, much” .. For Abstracts
Cue 3 is “adjectives like “big”, “small”, … For Concrete
P(cue_1=1|[0,1,0]) =
P(abstract=yes|[0,1,0])* P(cue_1=1|abstract=yes)
+
P(abstract=no|[0,1,0]) * P(cue_1=1|abstract=no)
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
10
LREC 2010, La Valletta, Malta, May 2010
Handling of missing values in lexical acquisition
• We use the information of observed features to assess
the likelihood of a particular unobserved cue.
• Harmonization is substituting 0 values by the likelihood
of being 1 given the other cues observed.
• BUT …
In order to get P(cue_1=1|[0,1,0]) we need to have
P(cue_n|class) and for all cues in the vector.
P (v | k ) 
 P ( j | k)
i
i
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
11
Handling of missing values in lexical acquisition
The challenge: how to get P(cue_n|class) with so
many 0’s in the data… ?
LREC 2010, La Valletta, Malta, May 2010
By estimating the P(cue_n|class) with linguistic
information
Suffix=no
Suffix=yes
SC_Adj=no
SC_Adj=yes
Abstract
0.5
0.5
1.0
0.0
Concrete
1.0
0.0
0.5
0.5
“The probability of being Concrete and having suffix “ness”
is 0”
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
12
Handling of missing values in lexical acquisition
LREC 2010, La Valletta, Malta, May 2010
Harmonization effects in Spanish Mass experiment
Harmonized
Frequency
types
0,1,0,1,0,1,1,0,1,0,0,1,1,0
0,3,0,1,0,1,1,0,1,0,0,1,1,0
agua (‘water’)
1,1,0.5,0.5,0.5,1,1,1,1,0,0,0,0,0
1,2,0,0,0,2,1,1,2,0,0,0,0,0
acero (‘steel’)
0.5,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0,0,
0,0,0
0,0,0,0,0,0,1,0,0,0,0,0,0,0
desabastecimiento
(‘shortage’)
0.02,0.02,0.02,0.02,0.02,0.02,0.02,0
.02,0.02,0.47,0.47,0.47,0.47,0.47
0,0,0,0,0,0,0,0,0,0,0,0,0,0
aceptabilidad
(‘acceptability’)
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
13
Handling of missing values in lexical acquisition
LREC 2010, La Valletta, Malta, May 2010
Results of the experiments
Experiment
Mean
Trimmed mead
Frequency
Harmonized
Baseline
Spanish Mass
DT
SVM
74.2 63.8
77.5 67.4
79.9 79.1
82.8 80.7
74.8
English Abstract
DT
SVM
57.8 61.0
55.6 61.0
61.4 64.1
76.1 70.1
61.5
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
14
Handling of missing values in lexical acquisition
Error Analysis & Future work
LREC 2010, La Valletta, Malta, May 2010
• Frequency information to filter noise has been
neutralized
• Future work is about how to handle missing values
and noise together.
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
15
LREC 2010, La Valletta, Malta, May 2010
Handling of missing values in lexical acquisition
Thanks for your
attention !
GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL)
16