A Bootstrapping Method for Building Subjectivity Lexicons for

Download Report

Transcript A Bootstrapping Method for Building Subjectivity Lexicons for

A Bootstrapping Method
for Building Subjectivity Lexicons
for Languages with Scarce Resources
Carmen Banea, Rada Mihalcea
University of North Texas
[email protected], [email protected]
Janyce Wiebe
University of Pittsburg
[email protected]
Subjectivity analysis
 Subjectivity analysis (opinions and sentiments)
 Used in a wide variety of applications
 Tracking sentiment timelines in news (Lloyd et. al, 2005)
 Review classification (Turney, 2002; Pang et. al, 2002)
 Mining opinions from product reviews (Hu and Liu, 2004)
 Expressive text-to-speech synthesis (Alm et. al, 2005)
 Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli and Sebastiani,
2006)
 Question answering (Yu and Hatzivassiloglou, 2003)
 Much work on subjectivity analysis has focused on English
 Japanese (Takumura et. al, 2006), Chinese (Hu et. al, 2005), German (Kim
and Hovy, 2006)
Proportion of Languages on the Web
internetworldstats.com ~ updated November 30, 2007
Objective
 Develop a method for subjectivity analysis that
 Requires few electronic resources
 Can be easily ported to a new language
 Applicable to the large number of languages that have
scarce electronic resources
Related Work
 Tools that rely on manually or semi-automatically constructed
lexicons
 Yu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim and Hovy, 2006
 Enable the efficient rule-based subjectivity and sentiment classifiers that rely
on the presence of lexicon entries in text
 These tools assume the availability of
 advanced language processing tools:
 Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003)
 broad-coverage rich lexical resources
 WordNet (Essuli and Sebastiani, 2006)
 Our approach relates most closely to the method of (Turney,
2002) for the construction of lexicons annotated for polarity
 We address the task of acquiring a subjectivity lexicon
 We rely on fewer, smaller-scale resources
Our Method
 Based on bootstrapping
 Requires:
 A small seed set of subjective entries
 One/multiple electronic dictionaries
 A small training corpus (approx. 500,000 words)
 Experiments focused on Romanian
 Applicable to other languages as well
Bootstrapping Process
seeds
query
Candidate
synonyms
Online dictionary
Max. no. of iterations?
no
Fixed
filtering
yes
Selected
synonyms
Candidate
synonyms
Variable
filtering
Seed Set
 60 seeds, evenhandedly
sampled from verbs, nouns,
adjectives and adverbs.
 Manually selected
 Seed sources:
Category Sample Entries (with their English
translation)
Noun
blestem (curse), despot (tyrant), furie
(fury), idiot (idiot), fericire (happiness)
Verb
iubi (love), aprecia (appreciate), spera
(hope), dori (wish), uri (hate)
Adjective
frumos (beautiful), dulce (sweet), urat
(ugly), fericit (happy), fascinant
(fascinating)
Adverb
posibil (possibly), probabil (probably),
desigur (of course), enervant
(unnerving)
 XI-th grade curriculum for
Romanian Language and
Literature
 Translations of instances
appearing in the
OpinionFinder strong
subjective lexicon (Wiebe
and Riloff, 2005)
Expansion
Seed
Candidate
synonyms
Definition
All open-class words, that have a
definition in the dictionary
longer than 3 letters
Diacritics are removed
 Romanian dictionary: http://www.dexonline.ro
 Dictionaries for other languages are also available, or can be
obtained from paper dictionaries through OCR
Filtering
 Candidates are filtered based on a measure of similarity
with the original seeds
 We use Latent Semantic Analysis (LSA)(Dumais et al.,
1988) trained on the SemCor corpus (Miller et al.,
1993)
 After each iteration, only candidates with an LSA score
higher than a given threshold are selected for further
expansion
 Example:
 Seed: dulce (sweet)
 Candidate synonyms: cu gust dulce (sweet-tasting). placut
(pleasant), dulceag (quasi-sweet)
Filtering
 Several iterations of the bootstrapping process will
result in a subjectivity lexicon consisting of a ranked
list of candidates in decreasing order of similarity to
the original seeds
 A variable filtering threshold can be used to further
restrict the similarity for a more pure lexicon
 Filtering parameters:
 Similarity threshold
 Number of iterations
Lexicon Acquisition
7000
No. of Lexicon Entries
6000
5000
0.4
0.45
0.5
0.55
4000
3000
2000
1000
0
1
2
3
No. of Iterations
4
5
Evaluation
 Rule-based classifier of subjectivity
 (Riloff and Wiebe, 2003)
 Subjective sentence: three or more subjective entries.
 Objective sentence: two subjective entries or less.
 Gold standard data set
 (Mihalcea, Banea and Wiebe, 2007)
 504 sentences from five SemCor documents (manually
translated in Romanian)
 Labeled by two annotators
 Agreement (all): 83% (=0.67)
 Agreement (uncertain removed): 89% (=0.77)
 Baseline: 54% (all subjective)
Number of Iterations
F-measure for the bootstrapping subjectivity lexicon over 5 iterations
and an LSA threshold of 0.5
Similarity Threshold
F-measure for the fifth bootstrapping iteration for varying LSA scores
Comparison
 Bootstrapping rule-based classifier: uses a 3913 entries subjectivity
lexicon obtained through 5 iterations and similarity threshold of 0.5
Conclusions
 Our bootstrapping method uses few electronic resources:
 A small seed set
 One/multiple dictionaries
 A small corpus of half a million words
 A large subjectivity lexicon of approx. 4000 entries was
extracted
 Using an unsupervised rule-based classifier, a subjectivity Fmeasure of 66.20% and an overall F-measure of 61.69% can
be achieved
Questions?