Text Classification: Weka

Download Report

Transcript Text Classification: Weka

SIMS 290-2:
Applied Natural Language Processing
Preslav Nakov
October 6, 2004
1
Today
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
2
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
3
20 Newsgroups Data Set
http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
Source: originally collected by Ken Lang
Content and structure:
approximately 20,000 newsgroup documents
– 19,997 originally
– 18,828 without duplicates
partitioned evenly across 20 different newsgroups
Some categories are strongly related (and thus hard to
discriminate):
computers
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale
talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
soc.religion.christian
4
Sample Posting: “talk.politics.guns”
from
From: [email protected] (C. D. Tavares)
Subject: Re: Congress to review ATF's status
subject
… writes:
In article <[email protected]>, [email protected] (Larry Cipriani) writes:
>
WASHINGTON (UPI) -- As part of its investigation of the deadly
> confrontation with a Texas cult, Congress will consider whether the
> Bureau of Alcohol, Tobacco and Firearms should be moved from the
> Treasury Department to the Justice Department, senators said Wednesday.
>
The idea will be considered because of the violent and fatal events
> at the beginning and end of the agency's confrontation with the Branch
> Davidian cult.
reply
Of course. When the catbox begines to smell, simply transfer its
contents into the potted plant in the foyer.
"Why Hillary! Your government smells so... FRESH!"
--
Need special
handling during
feature
extraction…
signature
[email protected] --If you believe that I speak for my company,
OR [email protected]
write today for my special Investors' Packet...
5
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
6
WEKA: The Bird
Copyright: Martin Kramer ([email protected]), University of Waikato, New Zealand
Slide adapted from Eibe Frank's
7
WEKA: Terminology
Some synonyms/explanations for the terms used by
WEKA, which may differ from what we adopted:
Attribute: feature
Relation: collection of examples
Instance: collection in use
Class: category
8
WEKA: The Software Toolkit
http://www.cs.waikato.ac.nz/ml/weka
Machine learning/data mining software in Java
GNU License
Used for research, education and applications
Complements “Data Mining” by Witten & Frank
Main features:
data pre-processing tools
learning algorithms
evaluation methods
graphical interface (incl. data visualization)
environment for comparing learning algorithms
Slide adapted from Eibe Frank's
9
WEKA GUI Chooser
Slide adapted from Eibe Frank's
java -Xmx1000M -jar weka.jar
10
Our Toy Example
We demonstrate WEKA on a toy example:
3 categories from “20 Newsgroups”:
– misc.forsale,
– rec.sport.hockey,
– comp.graphics
20 documents per category
features:
– words converted to lowercase
– frequency 2 or more required
– stopwords removed
Slide adapted from Eibe Frank's
11
Explorer: Pre-Processing The Data
WEKA can import data is from:
files: ARFF, CSV, C4.5, binary
URL
SQL database (using JDBC)
Pre-processing tools (filters) are used for:
Discretization, normalization, resampling,
attribute selection, transforming and
combining attributes, etc.
Slide adapted from Eibe Frank's
12
The Preprocessing Tab
Classification
Preprocessing
Statistical
attribute
selection
Filter selection
Manual attribute
selection
List of attributes
(last: class variable)
Statistics about
the values of the
selected attribute
Frequency and
categories for
the selected
attribute
13
Explorer: Building “Classifiers”
Classifiers in WEKA are models for:
classification (predict a nominal class)
regression (predict a numerical quantity)
Learning algorithms:
Naïve Bayes, decision trees, kNN, support
vector machines, multi-layer perceptron,
logistic regression, etc.
Meta-classifiers:
cannot be used alone
always combined with a learning algorithm
examples: boosting, bagging etc.
Slide adapted from Eibe Frank's
14
The Classification Tab
Choice of
classifier
Cross-validation: split the
data into e.g. 10 folds and
10 times train on 9 folds and
test on the remaining one
The attribute
whose value is to
be predicted from
the values of the
remaining ones.
Default is the last
attribute.
Here (in our toy
example) it is
named “class”.
15
Choosing a
classifier
16
17
displays synopsis
and options
False: Gaussian
True: kernels (better)
outputs additional
numerical to
information
nominal
conversion by
discretization
18
19
20
accuracy
different/easy class
all other
numbers can be
obtained from it
21
Confusion matrix
Contains information about the actual and the
predicted classification
All measures can be derived from it:
predicted
accuracy: (a+d)/(a+b+c+d)
recall: d/(c+d) => R
precision: d/(b+d) => P
F-measure: 2PR/(P+R)
false positive (FP) rate: b/(a+b)
true negative (TN) rate: a/(a+b)
false negative (FN) rate: c/(c+d)
true
–
+
–
a
b
+
c
d
These extend for more than 2 classes:
see previous lecture slides for details
22
Predictions Output
Outputs the
probability
distribution for
each example
23
Predictions Output
Probability
distribution
for a wrong
example:
predicted 1
instead of 3
Naïve Bayes
makes incorrect
conditional
independence
assumptions
and typically is
over-confident
in its prediction
regardless of
whether it is
correct or not.
24
Error Visualization
25
Error Visualization
Little squares
designate errors
Axes show
example number
26
Explorer: Attribute Selection
Find which attributes are the most predictive ones
Two parts:
search method:
– best-first, forward selection, random, exhaustive,
genetic algorithm, ranking
evaluation method:
– information gain, chi-squared, etc.
Very flexible: WEKA allows (almost) arbitrary
combinations of these two
Slide adapted from Eibe Frank's
27
Individual Features Ranking
28
Individual Features Ranking
misc.forsale
rec.sport.hockey
comp.graphics
29
Individual Features Ranking
misc.forsale
rec.sport.hockey
random
number
seed
comp.graphics
???
30
Feature Interactions
category
C
importance of
feature A
feature
A
importance of
feature B
feature correlation
B
feature
2-Way Interactions
Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's
31
Feature Interactions
category
C
importance of
feature A
feature
importance of
feature B
A
B
feature
3-Way Interaction:
What is common to A, B and C together;
and cannot be inferred from pairs of features.
Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's
32
Feature Subsets Selection
Problem illustration
Full set
Empty set
Enumeration
Search
Exhaustive/Complete (enumeration/branch&bounding)
Heuristic (sequential forward/backward)
Stochastic (generate/evaluate)
Individual features or subsets generation/evaluation
Slide adapted from Guozhu Dong's
33
Features Subsets Selection
34
Features Subsets Selection
misc.forsale
rec.sport.hockey
comp.graphics
17,309 subsets considered
21 attributes selected
35
Saving the Selected Features
All we can do from this
tab is to save the
buffer in a text file. Not
very useful...
But we can also
perform feature
selection during the
pre-processing step...
(the following slides)
36
Features Selection on Preprocessing
37
Features Selection on Preprocessing
38
Features Selection on Preprocessing
679 attributes:
678 + 1 (for the class)
39
Features Selection on Preprocessing
Just 22 attributes
remain:
21 + 1 (for the class)
40
Run Naïve Bayes With the 21 Features
higher
accuracy
21 Attributes
41
(AGAIN) Naïve Bayes With All Features
accuracy
different/easy class
ALL 679 Attributes
(repeated slide)
42
Some Important Algorithms
Sometimes WEKA has a weird naming for some algorithms
Here is how to find the algorithms Barbara introduced:
Naïve Bayes: weka.classifiers.bayes.NaiveBayes
Perceptron: weka.classifiers.functions.VotedPerceptron
Winnow: weka.classifiers.functions.winnow
Decision tree: weka.classifiers.trees.J48
Support vector machines: weka.classifiers.functions.SMO
k nearest neighbor: weka.classifiers.lazy.IBk
Some of these are more sophisticated versions of the classic
algorithms
e.g. I cannot find the classic Naïve Bayes in WEKA
(although there are 5 available implementations).
43
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
44
Performing Experiments
Experimenter makes it easy to compare the performance of
different learning schemes
Problems:
classification
regression
Results: written into file or database
Evaluation options:
cross-validation
learning curve
hold-out
Can also iterate over different parameter settings
Significance-testing built in!
Slide adapted from Eibe Frank's
45
Experiments Setup
46
Experiments Setup
47
Experiments Setup
datasets
CSV file: can be open
in Excel
algorithms
48
Experiments Setup
49
Experiments Setup
50
Experiments Setup
51
Experiments Setup
52
Experiments Setup
accuracy
SVM is
the best
SVM is
statistically
better than
Naïve Bayes
Decision tree is
statistically
worse than
Naïve Bayes
Decision
tree is the
worst
53
Experiments: Excel
Results are output into
an CSV file, which can
be read in Excel!
54
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
55
WEKA File Format: ARFF
@relation heart-disease-simplified
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
Numerical attribute
age numeric
Nominal attribute
sex { female, male}
chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
cholesterol numeric
exercise_induced_angina { no, yes}
Other attribute types:
class { present, not_present}
• String
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
Slide adapted from Eibe Frank's
• Date
Missing value
56
WEKA File Format: Sparse ARFF
Value 0 is not represented explicitly
Same header (i.e @relation and @attribute tags)
the @data section is different
Instead of
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
We have
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
This is especially useful for textual data (why?)
But! Problems with feature selection: cannot save results
57
Python Interface to WEKA
Works on the 20 newsgroups collection
Extracts the features
currently words
easy to modify, just change one or more of:
– extract_features_and_freqs()
– is_feature_good()
– build_stoplist()
Allows to filter out:
the stopwords
the infrequent features
Features are weighted by document frequency
Produces an ARFF file to be used by WEKA
58
Python Interface to WEKA
Allows to specify:
which subset of classes to consider
the number of documents for each class
the minimum feature frequency
regular expression pattern a feature should match
whether to remove the stopwords
whether to convert words to lowercase
kind of output to produce:
sparse (i.e., feature = value)
full vector (list of values)
59
Python Interface to WEKA: How To
Needs installed "20_newsgroups“ and "stopwords“
To get the things working under Windows:
open “__init__.py”
in the code below, substitute “/” with “\\”
###################################################
## 20 Newsgroups
groups = [(ng, ng+'/.*') for ng in '''
alt.atheism rec.autos sci.space comp.graphics rec.motorcycles
soc.religion.christian comp.os.ms-windows.misc rec.sport.baseball
talk.politics.guns comp.sys.ibm.pc.hardware rec.sport.hockey
talk.politics.mideast comp.sys.mac.hardware sci.crypt
talk.politics.misc comp.windows.x sci.electronics
talk.religion.misc misc.forsale sci.med'''.split()]
twenty_newsgroups = SimpleCorpusReader(
'20_newsgroups', '20_newsgroups/', '.*/.*', groups,
description_file='../20_newsgroups.readme')
del groups # delete temporary variable
60
Python Interface to WEKA
The Main Function
61
Python Interface to WEKA
Example Usage
Python dictionary
Estimated over the whole set!
cross-validation: OK; test/train: not OK
Use 1
62
Python Interface to WEKA
Functions You Will Probably Want To Modify
Also: stemming!
Also: word+POS!
convert to lowercase
Also: compounds!
63
Python Interface to WEKA
You might want to add… Stemming
Porter stemmer
>>> cats = Token(TEXT='cats', POS='NN')
>>> from nltk.stemmer.porter import *
>>> porter = PorterStemmer()
>>> porter.stem(cats)
>>> print cats
<POS='NN', STEM='cat', TEXT='cats'>
WordNet stemmer
morphy – morphological analyzer
you need the following packages installed:
– nltk.wordnet
– nltk-contrib.pywordnet
>>> from nltk_contrib.pywordnet.stemmer import *
>>> morphy('dogs')
'dog'
64
Python Interface to WEKA
You might want to add… TF.IDF
TF.IDF: tij  log(N/ni)
TF
– tij: frequency of term i in document j
– this is how features are currently weighted
IDF: log(N/ni)
– ni: number of documents containing term i
– N: total number of documents
Modify the function extract_features_and_freqs_forall()
65
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
66
Summary
The 20 Newsgroups Text Collection
WEKA: The Toolkit
Explorer
– Classification
– Feature selection
Experimenter
ARFF file format
Python Interface to WEKA
feature extraction
stemming
Weighting: TF.IDF
WEKA: Real-time Demo
67