Text Representation & Text Classification for Intelligent Information

Download Report

Transcript Text Representation & Text Classification for Intelligent Information

Ning Yu
School of Library and Information Science
Indiana University at Bloomington

Text Representation & Text Classification
for Intelligent Information Retrieval
Outline
 The big picture
 A specific problem – opinion detection
Intelligent information retrieval
 Characteristics
 Not restricted to keyword matching and Boolean search
 Deal with natural language query and advanced search criteria
 Coarse-to-fine level of granularity
 Automatically organize/evaluate/interpret solution space
 User-centered, e.g., adapt to user’s learning habit
 Etc.
Intelligent information retrieval
 System Preferences
 Various source of evidence
 Natural language processing
 Semantic web technologies
 Automatic text classification
 Etc.
Intelligent IR system diagram
A Specific Question:
Semi-Supervised Learning
for Identifying Opinions in Web Content
Dissertation work

Growing demand for online opinions

Enormous body of usergenerated content

About anything, published
anywhere and at any time

Useful for literature review,
decision making, market
monitoring, etc.
Major approaches for opinion detection
9
What’s Essential?
Labeled Data! And lots of them!!!

To acquire a broad and comprehensive collection of opinion-bearing
features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic
collocations, stylistic features, contextual features);

To generate complex patterns (e.g., “good amount”) that can approximate
the context of words.

To generate and evaluate opinion detection systems;

To allow evaluation of opinion detection strategies with high confidence;
9
Challenges for opinion detection
 Shortage of opinion-labeled data: manual annotation is
tedious, error-prone and difficult to scale up
 Domain transfer: strategies designed for opinion detection in
one data domain generally do not perform well in another
domain
Motivations & research question
 Easy to collect unlabeled user-generated content that contains
opinions
 Semi-Supervised Learning (SSL) requires only a limited number
of labeled data to automatically label unlabeled data; has
achieved promising results in NLP studies
Is SSL effective in opinion detection both in sparse data
situations and for domain adaptation?
Datasets & data split
Dataset
(sentences)
Blog Posts
Movie Reviews
News Articles
Opinion
4,843
5,000
5,297
Non-opinion
4,843
5,000
5,174
SSL
Evaluation(5%)
Labeled(1-5%)
Unlabeled (90%)
Baseline
Supervised Learning (SL)
Evaluation(5%)
Labeled(1-5%)
Full SL
Evaluation(5%)
Labeled(95%)
Two major SSL methods: Self-training
 Assumption: Highly
confident predictions made
by an initial opinion classifier
are reliable and can be
added to the labeled set.
 Limitation: Auto-labeled
data may be biased by the
particular opinion classifier.
Two major SSL methods: Co-training
 Assumption: Two opinion
classifiers with different
strengths and weaknesses
can benefit from each
other.
 Limitation: It is not always
easy to create two
different classifiers.
Experimental design
 General settings for SSL
 Naïve Bayes classifier for self-training
 Binary values for unigram and bigram features
 Co-training strategies:
 Unigrams and bigrams (content vs. context)
 Two randomly split feature/training sets
 A character-based language model (CLM) and a bag-of-words
model (BOW)
Results: Overall
 For movie reviews
and news articles, cotraining proved to be
most robust
 For blog posts, SSL
showed no benefits
over SL due to the
low initial accuracy
Results: Movie reviews
 Both self-training and
co-training can improve
opinion detection
performance
 Co-training is more
effective than selftraining
Results: Movie reviews (cont.)
 The more different the two classifiers, the better the performance
Results: Domain transfer
(movie reviews->blog posts)
 For a difficult domain (e.g., blog), simple self-training alone is
promising for tackling the domain transfer problem.
Contributions
 Comprehensive research expands the spectrum of SSL application to
opinion detection
 Investigation of SSL model that best fits the problem space extends
understanding of opinion detection and provides a resource for
knowledge-based representation
 Generation of guidelines and evaluation baselines advances later
studies using SSL algorithms in opinion detection
 Research extensible to other data domains, non-English texts, and
other text mining tasks
Thank you!
“All my opinions are
posted on my online blog.”
“If you want a second opinion,
I’ll ask my computer”
2
1
“A grade of 85 or higher will get
you favorable mention on my blog.”
www.CartoonStock.com