PPT - University of California, Irvine

Download Report

Transcript PPT - University of California, Irvine

CS 277: Data Mining
Text Classification
Padhraic Smyth
Department of Computer Science
University of California, Irvine
Some slides adapted from Introduction to Information Retrieval, by
Manning, Raghavan, and Schutze, Cambridge University Press, 2009,
with additional slides from Ray Mooney.
Progress Report
• New deadline
– In class, Thursday February 18th (not Tuesday)
• Outline
– 3 pages maximum
– Suggested content
• Brief restatement of the problem you are addressing (no need to repeat everything in
your original proposal), e.g., ½ a page
• Summary of progress so far
– Background papers read
– Data preprocessing, exploratory data analysis
– Algorithms/Software reviewed, implemented, tested
– Initial results (if any)
• Challenges and difficulties encountered
• Brief outline of plans between now and end of quarter
– Use diagrams, figures, tables, where possible
– Write clearly: check what you write
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Currently in the News….
• Communications of the ACM, Feb 2010
– Fun short article, “Where the Data is”
• http://cacm.acm.org/magazines/2010/2/69384-where-the-data-is/fulltext
– Technical article on Fast Dimension Reduction
• Introduction:
http://cacm.acm.org/magazines/2010/2/69382-strange-effects-in-high-dimension/fulltext
• Article:
http://cacm.acm.org/magazines/2010/2/69359-faster-dimension-reduction/fulltext
• New York Times, Feb 8, 2010
• Interesting study of which email articles are mailed most often
http://www.nytimes.com/2010/02/09/science/09tier.html?ref=technology
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Conferences related to Text Mining
• ACM SIGIR Conference
• WWW Conference
• ACM SIGKDD Conference
Worth looking at papers in recent years for many different ideas
related to text mining
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Further Reading on Text Classification
•
General references on text and language modeling
– Introduction to Information Retrieval, C. Manning, P. Raghavan, and H. Schutze,
Cambridge University Press, 2008
– Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT
Press, 1999.
– Speech and Language Processing: An Introduction to Natural Language
Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000.
– The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured
Data, Feldman and Sanger, Cambridge University Press, 2007
•
Web-related text mining in general
– S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data,
Morgan Kaufmann, 2003.
– See chapter 5 for discussion of text classification
•
SVMs for text classification
– T. Joachims, Learning to Classify Text using Support Vector Machines: Methods,
Theory and Algorithms, Kluwer, 2002
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Text Classification
•
Text classification has many applications
•
Data Representation
•
Classification Methods
– Spam email detection
– Automated tagging of streams of news articles, e.g., Google News
– Online advertising: what is this Web page about?
– “Bag of words” most commonly used: either counts or binary
– Can also use “phrases” (e.g., bigrams) for commonly occurring combinations of
words
– Naïve Bayes widely used (e.g., for spam email)
• Fast and reasonably accurate
– Support vector machines (SVMs)
• Typically the most accurate method in research studies
• But more complex computationally
– Logistic Regression (regularized)
• Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Ch. 13
Types of Labels/Categories/Classes
• Assigning labels to documents or web-pages
– Labels are most often topics such as Yahoo-categories
– "finance," "sports," "news>world>asia>business"
• Labels may be genres
– "editorials" "movie-reviews" "news”
• Labels may be opinion on a person/product
– “like”, “hate”, “neutral”
• Labels may be domain-specific
–
–
–
–
–
Data Mining Lectures
"interesting-to-me" : "not-interesting-to-me”
“contains adult language” : “doesn’t”
language identification: English, French, Chinese, …
search vertical: about Linux versus not
“link spam” : “not link spam”
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Common Data Sets used for Evaluation
•
Reuters
– 10700 labeled documents
– 10% documents with multiple class labels
•
Yahoo! Science Hierarchy
– 95 disjoint classes with 13,598 pages
•
20 Newsgroups data
– 18800 labeled USENET postings
– 20 leaf classes, 5 root level classes
•
WebKB
– 8300 documents in 7 categories such as “faculty”, “course”, “student”.
•
Industry
– 6449 home pages of companies partitioned into 71 classes
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Practical Issues
•
Tokenization
– Convert document to word counts = “bag of words”
– word token = “any nonempty sequence of characters”
– for HTML (etc) need to remove formatting
•
Canonical forms, Stopwords, Stemming
– Remove capitalization
– Stopwords:
• remove very frequent words (a, the, and…) – can use standard list
• Can also remove very rare words, e.g., words that only occur in k or fewer documents,
e.g., k = 5
– Stemming (next slide)
•
Data representation
– e.g., sparse 3 column for bag of words: <docid termid count>
– can use inverted indices, etc
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Toy example of a document-term matrix
database
SQL
index
regression
likelihood
linear
d1
24
21
9
0
0
3
d2
32
10
5
0
3
0
d3
12
16
5
0
0
0
d4
6
7
2
0
0
0
d5
43
31
20
0
3
0
d6
2
0
0
18
7
16
d7
0
0
1
32
12
0
d8
3
0
0
22
4
2
d9
1
0
0
34
27
25
d10
6
0
0
17
4
23
Stemming
• Want to reduce all morphological variants of a word to a single
index term
– e.g. a document containing words like fish and fisher may not be retrieved by a
query containing fishing (no fishing explicitly contained in the document)
• Stemming - reduce words to their root form
• e.g. fish – becomes a new index term
• Porter stemming algorithm (1980)
– relies on a preconstructed suffix list with associated rules
• e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant,
replace with suffix=IZE
– BINARIZATION => BINARIZE
• Not always desirable: e.g., {university, universal} -> univers (in Porter’s)
•
WordNet: dictionary-based approach
•
May be more useful for information retrieval/search than classification
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Confusion Matrix and Accuracy
True
Class 1
True
Class 2
True
Class 3
Predicted
Class 1
80
60
10
Predicted
Class 2
10
90
0
Predicted
Class 3
10
30
110
Accuracy = fraction of examples classified correctly
= 280/400
= 70%
Precision
Precision for class k
= fraction predicted as class k that belong to class k
= 90/100
= 90% for class 2 below
True
Class 1
True
Class 2
True
Class 3
Predicted
Class 1
80
60
10
Predicted
Class 2
10
90
0
Predicted
Class 3
10
30
110
Recall
Recall for class k
= fraction that belong to class k predicted as class k
= 90/180
= 50% for class 2 below
True
Class 1
True
Class 2
True
Class 3
Predicted
Class 1
80
60
10
Predicted
Class 2
10
90
0
Predicted
Class 3
10
30
110
Precision-Recall Curves
• Rank our test documents by predicted probability of belonging to
class k
– Sort the ranked list
– For t = 1…. N (N = total number of test documents)
• Select top t documents on the list
• Classify documents above as class k, documents below as not class k
• Compute precision-recall number
– Generates a set of precision-recall values we can plot
– Perfect performance: precision = 1, recall = 1
– Similar concept to ROC
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Precision-recall for category: Crude
1
0.9
0.8
0.7
Precision
0.6
0.5
LSVM
Decision Tree
Naïve Bayes
Rocchio
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Dumais
(1998)
Recall
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
F-measure
• F-measure = weighted harmonic mean of precision and recall
(at some fixed operating threshold for the classifier)
F1 = 1/ ( 0.5/P + 0.5/R )
= 2PR / (P + R)
Useful as a simple single summary measure of performance
Sometimes “breakeven” F1 used, i.e., measured when P = R
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Sec.13.6
F-Measures on Reuters Data Set
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Probabilistic “Generative” Classifiers
• Model p( x | ck ) for each class and perform classification via Bayes rule,
c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) }
• How to model p( x | ck )?
– p( x | ck ) = probability of a “bag of words” x given a class ck
– Two commonly used approaches (for text):
• Naïve Bayes: treat each term xj as being conditionally independent, given ck
• Multinomial: model a document with N words as N tosses of a p-sided die
– Other models possible but less common,
• E.g., model word order by using a Markov chain for p( x | ck )
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Naïve Bayes Classifier for Text
• Naïve Bayes classifier = conditional independence model
– Assumes conditional independence assumption given the class:
p( x | ck ) = P p( xj | ck )
– Note that we model each term xj as a discrete random variable
– Binary terms (Bernoulli):
p( x | ck ) = P p( xj | ck )
– Non-binary terms (counts) (far less common)
p( x | ck ) = P p( xj = k | ck )
can use a parametric model (e.g., Poisson) or non-parametric model
(e.g., histogram) for p(xj = k | ck ) distributions.
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Simple Example: Naïve Bayes (binary)
• Vocabulary = {complexity, bound, cluster, classifier}
• Two classes: class 1 = algorithms, class 2 = data mining
• Naïve Bayes model:
– Probability that a word appears in a document, given the document class
– Assumes that each word appears independently of the others
Data Mining Lectures
complexity
bound
regression
classifier
Algorithms
0.9
0.7
0.1
0.1
Data Mining
0.2
0.1
0.7
0.8
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Spam Email Classification
• Naïve Bayes widely used in spam filtering
– See Paul Graham’s A Plan for Spam (on the Web)
• Naive Bayes-like classifier with unusual parameter estimation
– Widely used in spam filters
• Naive Bayes works well when appropriately used
• Very easy to update the model: just update counts
– But also many other aspects
• black lists, HTML features, etc.
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Multinomial Classifier for Text
•
Multinomial Classification model
– Assume that the data are generated by a p-sided die (multinomial model)
p(x | ck )   p( xj | ck )
nj
j 1
– where product is over all terms in the vocabulary
nj = number of times term j occurs in the document
– Here we have a single random variable for each class, and the p( xj = i | ck )
probabilities sum to 1 over i (i.e., a multinomial model)
– Note that high correlation can not be modeled well compared to naïve Bayes
• e.g., two words that should almost always appear in a document for a given class
– Probabilities typically only defined and evaluated for non-zero counts
• But “zero counts” could also be modeled if desired
• This would be equivalent to a Naïve Bayes model with a geometric distribution on
counts
Simple Example: Multinomial
• Vocabulary = {complexity, bound, cluster, classifier}
• Two classes: class 1 = algorithms, class 2 = data mining
• Multinomial model:
– Probability of next word in a document, given the document class
– Assumes that words are “memoryless”
Data Mining Lectures
complexity
bound
regression
classifier
Algorithms
0.5
0.4
0.05
0.05
Data Mining
0.1
0.05
0.3
0.55
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Highest Probability Terms in Multinomial Distributions
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Prediction
• Naïve Bayes
–
log [ p( x | c ) p(c) ] = log [ P p( xj | c ) ] + log p(c)
=
S
log p( xj | c ) + log p(c)
Sum over all words in vocabulary
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Prediction
• Naïve Bayes
–
log [ p( x | c ) p(c) ] = log [ P p( xj | c ) ] + log p(c)
=
S
log p( xj | c ) + log p(c)
Sum over all words in vocabulary
• Multinomial
– log [ p( x | c ) p(c) ] = log [ P p( xj | c )nj ] + log p(c)
=
S
nj log p( xj | c ) + log p(c)
Sum over all words that occur in document
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Classification Example
• Document 1 (bag of words)
Counts
complexity
bound
regression
classifier
4
0
0
1
complexity
bound
regression
classifier
2
0
1
3
• Document 2 (bag of words)
Counts
• Compute log p(class |document) for both naïve Bayes and
multinomial model, for each document
– Can assume p(class 1) = p(class 2)
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Classification Example (for document 1)
Naïve Bayes
A
DM
D1
Multinomial
complexity
bound
regression
classifier
-0.11
-0.36
-2.3
-2.3
-1.6
-2.3
-0.36
-0.22
complexity
bound
regression
classifier
4
0
0
1
Data Mining Lectures
A
DM
D1
Notes on Text Classification
complexity
bound
regression
classifier
-0.69
-0.92
-3.0
-3.0
-2.3
-3.0
-1.2
-0.6
complexity
bound
regression
classifier
4
0
0
1
© Padhraic Smyth, UC Irvine
Classification Example (for document 1)
Naïve Bayes
A
DM
D1
Multinomial
complexity
bound
regression
classifier
-0.11
-0.36
-2.3
-2.3
-1.6
-2.3
-0.36
-0.22
complexity
bound
regression
classifier
4
0
0
1
complexity
bound
regression
classifier
-0.69
-0.92
-3.0
-3.0
-2.3
-3.0
-1.2
-0.6
A
DM
complexity
bound
regression
classifier
4
0
0
1
D1
S log p( xj | c ) + log p(c)
Class A:
score = -0.11 – 2.3 = -2.41
Class DM: score = -1.6 – 0.22 = -1.08
Data Mining Lectures
S nj log p( xj | c ) + log p(c)
Class A:
score = -4 * 0.69 – 3.0 ~ -5.8
Class DM: score = -4 * 2.3 – 0.6 ~ -8.9
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Comparing Naïve Bayes and Multinomial models
McCallum and Nigam (1998) Found
that multinomial outperformed naïve
Bayes (with binary features) in text
classification experiments
Note on names used in the literature
- Bernoulli (or multivariate Bernoulli)
sometimes used for binary version of
Naïve Bayes model
- multinomial model is also referred to
as “unigram” model
- multinomial model is also sometimes
(confusingly) referred to as naïve
Bayes
WebKB Data Set
• Classify webpages from CS departments into:
– student, faculty, course,project
• Train on ~5,000 hand-labeled web pages
– Cornell, Washington, U.Texas, Wisconsin
• Crawl and classify a new site (CMU)
• Results with NB
Student
Extracted
180
Correct
130
Accuracy:
72%
Data Mining Lectures
Faculty
66
28
42%
Person
246
194
79%
Notes on Text Classification
Project
99
72
73%
Course
28
25
89%
Departmt
1
1
100%
© Padhraic Smyth, UC Irvine
Comparing Naïve Bayes and Multinomial on Web KB Data
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Sec. 15.2.4
Classic Reuters-21578 Data Set
•
•
•
•
Most (over)used data set
21578 documents
9603 training, 3299 test articles (ModApte/Lewis split)
118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
•
•
Average document: about 90 types, 200 tokens
Average number of classes assigned
– 1.24 for docs with at least one category
•
Only about 10 out of 118 categories are large
Common categories
(#train, #test)
Data Mining Lectures
•
•
•
•
•
Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
Notes on Text Classification
•
•
•
•
•
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
© Padhraic Smyth, UC Irvine
Sec. 15.2.4
Example of Reuters document
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981"
NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining
industry positions on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the
future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate
whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC
said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the
industry, the NPPC added. Reuter
&#3;</BODY></TEXT></REUTERS>
35
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Comparing Multinomial and Naïve Bayes on Reuter’s Data
(from McCallum and Nigam, 1998)
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Comparing Multinomial and Naïve Bayes on Reuter’s Data
(from McCallum and Nigam, 1998)
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Comparing Naïve Bayes and Multinomial
(slide from Chris Manning, Stanford)
Results from classifying 13,589 Yahoo! Web pages in Science subtree
of hierarchy into 95 different topics
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Feature Selection
•
Performance of text classification algorithms can be optimized by selecting
only a subset of the discriminative terms
– Particularly important for Naïve Bayes (see previous slides)
– Multinomial classifier is less sensitive
•
Greedy search
– Start from empty set or full set and add/delete one at a time
– Heuristics for adding/deleting
• Information gain (mutual information of term with class)
• Chi-square
• Other ideas
Methods tend not to be particularly sensitive to the specific heuristic used for
feature selection, but some form of feature selection often improves
performance
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Sec.13.5.1
Feature selection via Mutual Information
• In training set, choose k words which best discriminate (give most
information about) the categories.
• The Mutual Information between a word, class is:
I (w, c) 
 
ew{0,1} ec {0,1}
p(ew , ec ) log
p(ew | ec )
p(ew )
For each word w and each category c
p(ew | ec ) 
Data Mining Lectures
the probability of w being in a document if
the document belongs to class c
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Sec.13.5.1
Example of Feature Selection
• For each category we build a list of k most discriminating terms
– Use the union of these terms in our classifier
– K selected heuristically, or via cross-validation
• For example (on 20 Newsgroups), using mutual information
– sci.electronics: circuit, voltage, amp, ground, copy, battery,
electronics, cooling, …
– rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision,
autos, tires, toyota, …
• Greedy: does not account for correlations between terms
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Example of Feature Selection
(from Chakrabarti, Chapter 5)
9600 documents from US Patent database
20,000 raw features (terms)
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Comments on Generative Models for Text
(Comments applicable to both Naïve Bayes and Multinomial classifiers)
•
Simple and fast => popular in practice
– e.g., linear in p, n, M for both training and prediction
• Training = “smoothed” frequency counts, e.g.,
p(ck | xj  1) 
nk , j  k
nk   m
– e.g., easy to use in situations where classifier needs to be updated regularly
(e.g., for spam email)
•
Numerical issues
– Typically work with log p( ck | x ), etc., to avoid numerical underflow
– Useful trick for naïve Bayes classifier:
• when computing S log p( xj | ck ) , for sparse data, it may be much faster to
– precompute S log p( xj = 0| ck )
– and then subtract off the log p( xj = 1| ck ) terms
•
Note: both models are “wrong”: but for classification are often sufficient
Beyond Independence
•
Naïve Bayes and multinomial both assume conditional independence of
words given class
•
Alternative approaches try to account for higher-order dependencies
– Bayesian networks:
•
•
•
•
•
p(x | c) = Px p(x | parents(x), c)
Equivalent to directed graph where edges represent direct dependencies
Various algorithms that search for a good network structure
Useful for improving quality of distribution model
….however, this does not always translate into better classification
– Maximum entropy models
•
•
•
•
•
•
Data Mining Lectures
p(x | c) = 1/Z Psubsets f( subsets(x) | c)
Equivalent to undirected graph model
Estimation is equivalent to maximum entropy assumption
Feature selection is crucial (which f terms to include) –
can provide high accuracy classification
…. however, tends to be computationally complex to fit (estimating Z is difficult)
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Dumais et al. 1998: Reuters - Accuracy
earn
acq
money-fx
grain
crude
trade
interest
ship
wheat
corn
Avg Top 10
Avg All Cat
Data Mining Lectures
NBayes
Trees
LinearSVM
95.9%
97.8%
98.2%
87.8%
89.7%
92.8%
56.6%
66.2%
74.0%
78.8%
85.0%
92.4%
79.5%
85.0%
88.3%
63.9%
72.5%
73.5%
64.9%
67.1%
76.3%
85.4%
74.2%
78.0%
69.7%
92.5%
89.7%
65.3%
91.8%
91.1%
81.5%
75.2% na
88.4%
Notes on Text Classification
91.4%
86.4%
© Padhraic Smyth, UC Irvine
Sec. 15.2.4
Precision-recall for category: Ship
1
0.9
0.8
Precision
0.7
0.6
0.5
LSVM
Decision Tree
Naïve Bayes
Rocchio
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Recall
Data Mining Lectures
Notes on Text Classification
0.8
1
Dumais
(1998)
© Padhraic Smyth, UC Irvine
Sec. 15.2.4
Micro- vs. Macro-Averaging
•
If we have more than one class, how do we combine multiple performance
measures into one quantity?
•
Macroaveraging: Compute performance for each class, then average.
•
Microaveraging: Collect decisions for all classes, compute contingency table,
evaluate.
Note that macroaveraging ignores how likely each class is, gives performance
on each class equal weight.
Microaveraging weights according to how likely the classes are.
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Sec. 15.2.4
Micro- vs. Macro-Averaging: Example
Class 1
Truth:
yes
Class 2
Truth:
no
Micro Ave. Table
Truth:
yes
Truth:
no
Truth:
yes
Truth:
no
Classifi 10
er: yes
10
Classifi
er: yes
90
10
Classifier:
yes
100
20
Classifi 10
er: no
970
Classifi
er: no
10
890
Classifier:
no
20
1860

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7
Microaveraged precision: 100/120 = .83

Microaveraged score is dominated by score on common classes

Sec. 15.2.4
Results on Reuters Data
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Sec. 15.2.4
SVM vs. Other Methods
(Yang and Liu, 1999)
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear
SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory,
and University Web pages from WebKB. From Chakrabarti, 2003, Chapter 5.
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Sec. 15.3.2
Upweighting
• You can get a lot of value by differentially weighting contributions
from different document zones
– e.g., you count as two instance of a word when you see it in, say, the
abstract
– Upweighting title words helps (Cohen & Singer 1996)
• Doubling the weighting on the title words is a good rule of thumb
– Upweighting the first sentence of each paragraph helps (Murata, 1999)
– Upweighting sentences that contain title words helps (Ko et al, 2002)
Are there principled approaches to learn where and how to upweight for a
specific text classification task and corpus?
Data Mining Lectures
Notes on Text Classification
52
© Padhraic Smyth, UC Irvine
Sec. 15.3.1
Accuracy as a function of data size
• With enough data the choice
of classifier may not matter
much, and the best choice may
be unclear
Scaling to very very large corpora for natural language
disambiguation
Banko and Brill, Annual Meeting of the ACL, 2001
– Data: Banko and Brill on
“confusion-set” disambiguation,
e.g., principle v. principal
• But the fact that you have to
keep doubling your data to
improve performance is a little
unpleasant
53
Learning with Labeled and Unlabeled documents
• In practice, obtaining labels for documents is time-consuming,
expensive, and error prone
– Typical application: small number of labeled docs and a very large
number of unlabeled docs
• Idea:
– Build a probabilistic model on labeled docs
– Classify the unlabeled docs, get p(class | doc) for each class and doc
• This is equivalent to the E-step in the EM algorithm
– Now relearn the probabilistic model using the new “soft labels”
• This is equivalent to the M-step in the EM algorithm
– Continue to iterate until convergence (e.g., class probabilities do not
change significantly)
– This EM approach to classification shows that unlabeled data can help
in classification performance, compared to labeled data alone
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Learning with Labeled and Unlabeled Data
Graph from “Semi-supervised text classification using EM”, Nigam, McCallum, and Mitchell, 2006
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
MultiLabel Text Classification
•
Each document can have 1, 2, 3, … class labels
– E.g., news-articles annotated with “tags”
– Wikipedia categories
– MESH headings for biomedical articles
•
Common approach
– K labels
– Train K different “one versus all” classifiers, e.g., K binary SVMs
– Limitations
• Does not account for label dependencies (e.g. positive and negative correlations in
label patterns)
• Can have trouble with (a) many labels per document, and (b) few documents per label
– Alternatives
• Train K(K-1)/2 binary classifiers – not good computationally
• Capture labels correlations in lower dimensional space
– “compressed sensing” – Hsu, Kakade, Langford, NIPS 2009
– topic models – new research at UCI, will discuss later in quarter
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Other aspects of text classification
•
Real-time constraints:
•
Document length
•
Linked documents
– Being able to update classifiers as new data arrives
– Being able to make predictions very quickly in real-time
– Varying document length can be a problem for some classifiers
– Multinomial tends to be less sensitive than Bernoulli for example
– Can view Web documents as nodes in a directed graph
– Classification can now be performed that leverages the link structure,
• Heuristic = class labels of linked pages are more likely to be the same
– Optimal solution is to classify all documents jointly rather than individually
– Resulting “global classification” problem is typically computationally complex
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Summary of Text Classifiers
•
Naïve Bayes models (Bernoulli or Multinomial)
– Low time complexity (single linear pass through the data)
– Generally good, but not always best
– Widely used for spam email filtering
•
Linear SVMs
– Often produce best results in research studies
– But computationally complex to train
– not so widely used in practice as naïve Bayes
•
Logistic regression
– Can be as accurate as SVMs, computationally better
– A variant of this is sometimes referred to as “maximum entropy”
– Adaptations for sparse text data
• Feature selection or regularization (e.g., L1/Lasso) is essential
• Stochastic gradient descent can be very useful (see earlier lecture)
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Ch. 13
Alternatives: Manual Classification
• Manual classification
–
–
–
–
–
Used by the original Yahoo! Directory
Looksmart, about.com, ODP, PubMed
Very accurate when job is done by experts
Consistent when the problem size and team is small
Difficult and expensive to scale
– Can be noisy when done by different people
• E.g., spam/non-spam categorization by users
Note that this is how our training data is created for classification
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Ch. 13
Rule-Based Systems
Hand-coded rule-based systems
• One technique used by spam filters, Reuters, CIA, etc.
– Widely used in government and industry
• Companies provide “IDE” for writing such rules
• E.g., assign category if document contains a given Boolean
combination of words
• Standing queries: Commercial systems have complex query
languages (everything in IR query languages +score accumulators)
• Accuracy is often very high if a rule has been carefully refined over
time by a subject expert
• Building and maintaining these rules is expensive
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Further Reading on Text Classification
• Web-related text mining in general
– S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext
Data, Morgan Kaufmann, 2003.
– See chapter 5 for discussion of text classification
• General references on text and language modeling
– Foundations of Statistical Language Processing, C. Manning and H.
Schutze, MIT Press, 1999.
– Speech and Language Processing: An Introduction to Natural Language
Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000.
• SVMs for text classification
– T. Joachims, Learning to Classify Text using Support Vector Machines:
Methods, Theory and Algorithms, Kluwer, 2002
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine
Ch. 15
Additional Papers and Sources
•
•
•
•
•
•
•
•
•
Christopher J. C. Burges. 1998. A Tutorial on Support Vector Machines for Pattern
Recognition
S. T. Dumais. 1998. Using SVMs for text categorization, IEEE Intelligent Systems, 13(4)
S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive learning algorithms and
representations for text categorization. CIKM ’98, pp. 148-155.
Yiming Yang, Xin Liu. 1999. A re-examination of text categorization methods. 22nd Annual
International SIGIR
Tong Zhang, Frank J. Oles. 2001. Text Categorization Based on Regularized Linear
Classification Methods. Information Retrieval 4(1): 5-31
Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning: Data
Mining, Inference and Prediction. Springer-Verlag, New York.
T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
Fan Li, Yiming Yang. 2003. A Loss Function Analysis for Classification Methods in Text
Categorization. ICML 2003: 472-479.
Tie-Yan Liu, Yiming Yang, Hao Wan, et al. 2005. Support Vector Machines Classification with
Very Large Scale Taxonomy, SIGKDD Explorations, 7(1): 36-43.
Data Mining Lectures
Notes on Text Classification
© Padhraic Smyth, UC Irvine