A Bootstrapping Method for Building Subjectivity Lexicons for
Download
Report
Transcript A Bootstrapping Method for Building Subjectivity Lexicons for
A Bootstrapping Method
for Building Subjectivity Lexicons
for Languages with Scarce Resources
Carmen Banea, Rada Mihalcea
University of North Texas
[email protected], [email protected]
Janyce Wiebe
University of Pittsburg
[email protected]
Subjectivity analysis
Subjectivity analysis (opinions and sentiments)
Used in a wide variety of applications
Tracking sentiment timelines in news (Lloyd et. al, 2005)
Review classification (Turney, 2002; Pang et. al, 2002)
Mining opinions from product reviews (Hu and Liu, 2004)
Expressive text-to-speech synthesis (Alm et. al, 2005)
Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli and Sebastiani,
2006)
Question answering (Yu and Hatzivassiloglou, 2003)
Much work on subjectivity analysis has focused on English
Japanese (Takumura et. al, 2006), Chinese (Hu et. al, 2005), German (Kim
and Hovy, 2006)
Proportion of Languages on the Web
internetworldstats.com ~ updated November 30, 2007
Objective
Develop a method for subjectivity analysis that
Requires few electronic resources
Can be easily ported to a new language
Applicable to the large number of languages that have
scarce electronic resources
Related Work
Tools that rely on manually or semi-automatically constructed
lexicons
Yu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim and Hovy, 2006
Enable the efficient rule-based subjectivity and sentiment classifiers that rely
on the presence of lexicon entries in text
These tools assume the availability of
advanced language processing tools:
Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003)
broad-coverage rich lexical resources
WordNet (Essuli and Sebastiani, 2006)
Our approach relates most closely to the method of (Turney,
2002) for the construction of lexicons annotated for polarity
We address the task of acquiring a subjectivity lexicon
We rely on fewer, smaller-scale resources
Our Method
Based on bootstrapping
Requires:
A small seed set of subjective entries
One/multiple electronic dictionaries
A small training corpus (approx. 500,000 words)
Experiments focused on Romanian
Applicable to other languages as well
Bootstrapping Process
seeds
query
Candidate
synonyms
Online dictionary
Max. no. of iterations?
no
Fixed
filtering
yes
Selected
synonyms
Candidate
synonyms
Variable
filtering
Seed Set
60 seeds, evenhandedly
sampled from verbs, nouns,
adjectives and adverbs.
Manually selected
Seed sources:
Category Sample Entries (with their English
translation)
Noun
blestem (curse), despot (tyrant), furie
(fury), idiot (idiot), fericire (happiness)
Verb
iubi (love), aprecia (appreciate), spera
(hope), dori (wish), uri (hate)
Adjective
frumos (beautiful), dulce (sweet), urat
(ugly), fericit (happy), fascinant
(fascinating)
Adverb
posibil (possibly), probabil (probably),
desigur (of course), enervant
(unnerving)
XI-th grade curriculum for
Romanian Language and
Literature
Translations of instances
appearing in the
OpinionFinder strong
subjective lexicon (Wiebe
and Riloff, 2005)
Expansion
Seed
Candidate
synonyms
Definition
All open-class words, that have a
definition in the dictionary
longer than 3 letters
Diacritics are removed
Romanian dictionary: http://www.dexonline.ro
Dictionaries for other languages are also available, or can be
obtained from paper dictionaries through OCR
Filtering
Candidates are filtered based on a measure of similarity
with the original seeds
We use Latent Semantic Analysis (LSA)(Dumais et al.,
1988) trained on the SemCor corpus (Miller et al.,
1993)
After each iteration, only candidates with an LSA score
higher than a given threshold are selected for further
expansion
Example:
Seed: dulce (sweet)
Candidate synonyms: cu gust dulce (sweet-tasting). placut
(pleasant), dulceag (quasi-sweet)
Filtering
Several iterations of the bootstrapping process will
result in a subjectivity lexicon consisting of a ranked
list of candidates in decreasing order of similarity to
the original seeds
A variable filtering threshold can be used to further
restrict the similarity for a more pure lexicon
Filtering parameters:
Similarity threshold
Number of iterations
Lexicon Acquisition
7000
No. of Lexicon Entries
6000
5000
0.4
0.45
0.5
0.55
4000
3000
2000
1000
0
1
2
3
No. of Iterations
4
5
Evaluation
Rule-based classifier of subjectivity
(Riloff and Wiebe, 2003)
Subjective sentence: three or more subjective entries.
Objective sentence: two subjective entries or less.
Gold standard data set
(Mihalcea, Banea and Wiebe, 2007)
504 sentences from five SemCor documents (manually
translated in Romanian)
Labeled by two annotators
Agreement (all): 83% (=0.67)
Agreement (uncertain removed): 89% (=0.77)
Baseline: 54% (all subjective)
Number of Iterations
F-measure for the bootstrapping subjectivity lexicon over 5 iterations
and an LSA threshold of 0.5
Similarity Threshold
F-measure for the fifth bootstrapping iteration for varying LSA scores
Comparison
Bootstrapping rule-based classifier: uses a 3913 entries subjectivity
lexicon obtained through 5 iterations and similarity threshold of 0.5
Conclusions
Our bootstrapping method uses few electronic resources:
A small seed set
One/multiple dictionaries
A small corpus of half a million words
A large subjectivity lexicon of approx. 4000 entries was
extracted
Using an unsupervised rule-based classifier, a subjectivity Fmeasure of 66.20% and an overall F-measure of 61.69% can
be achieved
Questions?