Transcript final

Optimizing Text Classification
Mark Trenorden
Supervisor: Geoff Webb
Introduction

What is Text Classification?

Naïve Bayes

Event Models

Binomial Model

Binning

Conclusion
Text Classification

Grouping documents of the same topics

For example, Sport, Politics, e.t.c.

Slow process for humans
Naïve Bayes



P(cj | di) = P(cj) P(di | cj)
P(d)
This is Bayes theorem
Naïve Bayes assumes independence
between attributes, in this case words.
Not a correct assumption however still
performs classification well.
Event Models

Different ways of viewing a document

In Bayes rule this translates to different ways
of calculating, P(di | cj).

There are two frequently used models
Multi–Variate Bernoulli Model

In text classification terms,
–
A document(di) is an EVENT
–
Words(wt) within the document are considered as
ATTRIBUTES of di
–
Number of occurrences of a word in a document
is not recorded
–
When calculating the probability of class member
ship all words in the vocabulary are considered
even if thet don’t appear in document
Multinomial Model




Number of occurrences of a word is captured
Individual word occurrences are considered
as “events”
The document is considered to be a
collection of events
Only words that appear in the document and
their counts are considered when calculation
class membership
Previous Comparison

Multi-Variate model good for small
vocabulary

Multi-Nomial model good for large
vocabulary.

Multi-Nomial much faster then the MultiVariate
Binomial Model

Want to capture occurances and nonoccurances as well as word frequencies.

P(di | cj) =
Sum of P(c) + P(w | d)N * P(~w | d)L-N
Where c = class, w = word, d = document, L = length and n =
no of occurances of word
Binomial Results

Performed just as well as multinomial with
large vocabulary, however much slower.

Outperformed Multi-Variate once vocabulary
increased

However did worse then existing techniques
with smaller vocabulary sizes
Binomial Results
80
75
70
% Correctly
Classed
Multi-Variate
65
Binomial
Multi-Nomial
60
55
50
100
1000
5000
Number of Words in the Vocabulary
Document Length

None of the techniques take in to account
document length. Currently,
P(d | c) = f (w Є d, c)

However we should incorporate document
length.
P(d | c) = f (w Є d, l, c)
Binning

Discretization has been found to be effective for
numeric variables for Naïve Bayes.

Groups documents of similar lengths

Theory is the distributions will differ significantly for
different lengths

This will help improve classification
Binning

For my tests, bin size = 1000, if less then
2000 documents only use two bins.
Bin 1
Bin 2
Bin 3
Increasing Document Size
Bin 4
Binning Example

Two Bins are created. Tables with word counts for each class
within a bin for are created as opposed to one table for all
words as per traditional methods.
Length 0 -10 words
Length 11 -20 words
George Bush Cat
GWB
Not
GWB
4/20
7/20
3/20
3/20 1/20
2/20
George Bush Cat
GWB
Not
GWB
2/20
3/20
3/20
3/20
2/20
7/20
Binning



Given a unseen document, binning helps
refine probabilities. For example
If no bins, the probability that the word ‘Bush’
occurs in the GWB class is 10/40 or 25%.
If we know that the document is in the 0 -10
words bin the probability of the word ‘Bush’
appearing in GWB is 7/20 or 35%.
Binning Results

When applied to all datasets binning
improved classification accuracy on all
techniques
Binning Results
7 Sectors Dataset, Multi-Variate Method
76
74
% Correct
72
70
68
Use Bins
66
No Bins
64
62
60
58
100
1000
Vocabulary Size
5000
Binning Results
WebKB Dataset, Multi-Nomial Method
65
% Correct
60
55
Use Bins
No Bins
50
45
40
100
1000
Vocabulary Size
5000
Conclusion/Future Goals

Binning best solution

Applicable to all event models

In future apply event models and binning
techniques to classification techniques other
then Naïve Bayes.