Text Representation & Text Classification for Intelligent Information
Download
Report
Transcript Text Representation & Text Classification for Intelligent Information
Ning Yu
School of Library and Information Science
Indiana University at Bloomington
Text Representation & Text Classification
for Intelligent Information Retrieval
Outline
The big picture
A specific problem – opinion detection
Intelligent information retrieval
Characteristics
Not restricted to keyword matching and Boolean search
Deal with natural language query and advanced search criteria
Coarse-to-fine level of granularity
Automatically organize/evaluate/interpret solution space
User-centered, e.g., adapt to user’s learning habit
Etc.
Intelligent information retrieval
System Preferences
Various source of evidence
Natural language processing
Semantic web technologies
Automatic text classification
Etc.
Intelligent IR system diagram
A Specific Question:
Semi-Supervised Learning
for Identifying Opinions in Web Content
Dissertation work
Growing demand for online opinions
Enormous body of usergenerated content
About anything, published
anywhere and at any time
Useful for literature review,
decision making, market
monitoring, etc.
Major approaches for opinion detection
9
What’s Essential?
Labeled Data! And lots of them!!!
To acquire a broad and comprehensive collection of opinion-bearing
features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic
collocations, stylistic features, contextual features);
To generate complex patterns (e.g., “good amount”) that can approximate
the context of words.
To generate and evaluate opinion detection systems;
To allow evaluation of opinion detection strategies with high confidence;
9
Challenges for opinion detection
Shortage of opinion-labeled data: manual annotation is
tedious, error-prone and difficult to scale up
Domain transfer: strategies designed for opinion detection in
one data domain generally do not perform well in another
domain
Motivations & research question
Easy to collect unlabeled user-generated content that contains
opinions
Semi-Supervised Learning (SSL) requires only a limited number
of labeled data to automatically label unlabeled data; has
achieved promising results in NLP studies
Is SSL effective in opinion detection both in sparse data
situations and for domain adaptation?
Datasets & data split
Dataset
(sentences)
Blog Posts
Movie Reviews
News Articles
Opinion
4,843
5,000
5,297
Non-opinion
4,843
5,000
5,174
SSL
Evaluation(5%)
Labeled(1-5%)
Unlabeled (90%)
Baseline
Supervised Learning (SL)
Evaluation(5%)
Labeled(1-5%)
Full SL
Evaluation(5%)
Labeled(95%)
Two major SSL methods: Self-training
Assumption: Highly
confident predictions made
by an initial opinion classifier
are reliable and can be
added to the labeled set.
Limitation: Auto-labeled
data may be biased by the
particular opinion classifier.
Two major SSL methods: Co-training
Assumption: Two opinion
classifiers with different
strengths and weaknesses
can benefit from each
other.
Limitation: It is not always
easy to create two
different classifiers.
Experimental design
General settings for SSL
Naïve Bayes classifier for self-training
Binary values for unigram and bigram features
Co-training strategies:
Unigrams and bigrams (content vs. context)
Two randomly split feature/training sets
A character-based language model (CLM) and a bag-of-words
model (BOW)
Results: Overall
For movie reviews
and news articles, cotraining proved to be
most robust
For blog posts, SSL
showed no benefits
over SL due to the
low initial accuracy
Results: Movie reviews
Both self-training and
co-training can improve
opinion detection
performance
Co-training is more
effective than selftraining
Results: Movie reviews (cont.)
The more different the two classifiers, the better the performance
Results: Domain transfer
(movie reviews->blog posts)
For a difficult domain (e.g., blog), simple self-training alone is
promising for tackling the domain transfer problem.
Contributions
Comprehensive research expands the spectrum of SSL application to
opinion detection
Investigation of SSL model that best fits the problem space extends
understanding of opinion detection and provides a resource for
knowledge-based representation
Generation of guidelines and evaluation baselines advances later
studies using SSL algorithms in opinion detection
Research extensible to other data domains, non-English texts, and
other text mining tasks
Thank you!
“All my opinions are
posted on my online blog.”
“If you want a second opinion,
I’ll ask my computer”
2
1
“A grade of 85 or higher will get
you favorable mention on my blog.”
www.CartoonStock.com