Transcript final
Optimizing Text Classification
Mark Trenorden
Supervisor: Geoff Webb
Introduction
What is Text Classification?
Naïve Bayes
Event Models
Binomial Model
Binning
Conclusion
Text Classification
Grouping documents of the same topics
For example, Sport, Politics, e.t.c.
Slow process for humans
Naïve Bayes
P(cj | di) = P(cj) P(di | cj)
P(d)
This is Bayes theorem
Naïve Bayes assumes independence
between attributes, in this case words.
Not a correct assumption however still
performs classification well.
Event Models
Different ways of viewing a document
In Bayes rule this translates to different ways
of calculating, P(di | cj).
There are two frequently used models
Multi–Variate Bernoulli Model
In text classification terms,
–
A document(di) is an EVENT
–
Words(wt) within the document are considered as
ATTRIBUTES of di
–
Number of occurrences of a word in a document
is not recorded
–
When calculating the probability of class member
ship all words in the vocabulary are considered
even if thet don’t appear in document
Multinomial Model
Number of occurrences of a word is captured
Individual word occurrences are considered
as “events”
The document is considered to be a
collection of events
Only words that appear in the document and
their counts are considered when calculation
class membership
Previous Comparison
Multi-Variate model good for small
vocabulary
Multi-Nomial model good for large
vocabulary.
Multi-Nomial much faster then the MultiVariate
Binomial Model
Want to capture occurances and nonoccurances as well as word frequencies.
P(di | cj) =
Sum of P(c) + P(w | d)N * P(~w | d)L-N
Where c = class, w = word, d = document, L = length and n =
no of occurances of word
Binomial Results
Performed just as well as multinomial with
large vocabulary, however much slower.
Outperformed Multi-Variate once vocabulary
increased
However did worse then existing techniques
with smaller vocabulary sizes
Binomial Results
80
75
70
% Correctly
Classed
Multi-Variate
65
Binomial
Multi-Nomial
60
55
50
100
1000
5000
Number of Words in the Vocabulary
Document Length
None of the techniques take in to account
document length. Currently,
P(d | c) = f (w Є d, c)
However we should incorporate document
length.
P(d | c) = f (w Є d, l, c)
Binning
Discretization has been found to be effective for
numeric variables for Naïve Bayes.
Groups documents of similar lengths
Theory is the distributions will differ significantly for
different lengths
This will help improve classification
Binning
For my tests, bin size = 1000, if less then
2000 documents only use two bins.
Bin 1
Bin 2
Bin 3
Increasing Document Size
Bin 4
Binning Example
Two Bins are created. Tables with word counts for each class
within a bin for are created as opposed to one table for all
words as per traditional methods.
Length 0 -10 words
Length 11 -20 words
George Bush Cat
GWB
Not
GWB
4/20
7/20
3/20
3/20 1/20
2/20
George Bush Cat
GWB
Not
GWB
2/20
3/20
3/20
3/20
2/20
7/20
Binning
Given a unseen document, binning helps
refine probabilities. For example
If no bins, the probability that the word ‘Bush’
occurs in the GWB class is 10/40 or 25%.
If we know that the document is in the 0 -10
words bin the probability of the word ‘Bush’
appearing in GWB is 7/20 or 35%.
Binning Results
When applied to all datasets binning
improved classification accuracy on all
techniques
Binning Results
7 Sectors Dataset, Multi-Variate Method
76
74
% Correct
72
70
68
Use Bins
66
No Bins
64
62
60
58
100
1000
Vocabulary Size
5000
Binning Results
WebKB Dataset, Multi-Nomial Method
65
% Correct
60
55
Use Bins
No Bins
50
45
40
100
1000
Vocabulary Size
5000
Conclusion/Future Goals
Binning best solution
Applicable to all event models
In future apply event models and binning
techniques to classification techniques other
then Naïve Bayes.