SIMS 290-2: Applied Natural Language Processing

Download Report

Transcript SIMS 290-2: Applied Natural Language Processing

SIMS 290-2:
Applied Natural Language Processing
Marti Hearst
October 30, 2006
Some slides by Preslav Nakov and Eibe Frank
1
Today
The Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
2
20 Newsgroups Data Set
http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
Source: originally collected by Ken Lang
Content and structure:
approximately 20,000 newsgroup documents
– 19,997 originally
– 18,828 without duplicates
partitioned evenly across 20 different newsgroups
we are only using a subset (6 newsgroups)
Some categories are strongly related (and thus hard to discriminate):
computers
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale
talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
soc.religion.christian
3
Sample Posting: “talk.politics.guns”
from
From: [email protected] (C. D. Tavares)
Subject: Re: Congress to review ATF's status
subject
… writes:
In article <[email protected]>, [email protected] (Larry Cipriani) writes:
>
WASHINGTON (UPI) -- As part of its investigation of the deadly
> confrontation with a Texas cult, Congress will consider whether the
> Bureau of Alcohol, Tobacco and Firearms should be moved from the
> Treasury Department to the Justice Department, senators said Wednesday.
>
The idea will be considered because of the violent and fatal events
> at the beginning and end of the agency's confrontation with the Branch
> Davidian cult.
reply
Of course. When the catbox begines to smell, simply transfer its
contents into the potted plant in the foyer.
"Why Hillary! Your government smells so... FRESH!"
--
Need special
handling during
feature
extraction…
signature
[email protected] --If you believe that I speak for my company,
OR [email protected]
write today for my special Investors' Packet...
4
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
5
WEKA: The Bird
Copyright: Martin Kramer ([email protected]), University of Waikato, New Zealand
Slide adapted from Eibe Frank's
6
WEKA: the software
Waikato Environment for Knowledge Analysis
Collection of state-of-the-art machine learning algorithms and data
processing tools implemented in Java
Released under the GPL
Support for the whole process of experimental data mining
Preparation of input data
Statistical evaluation of learning schemes
Visualization of input data and the result of learning
Used for education, research and applications
Complements “Data Mining” by Witten & Frank
Slide by Eibe Frank
7
Main Features
49 data preprocessing tools
76 classification/regression algorithms
8 clustering algorithms
15 attribute/subset evaluators + 10 search
algorithms for feature selection
3 algorithms for finding association rules
3 graphical user interfaces
“The Explorer” (exploratory data analysis)
“The Experimenter” (experimental environment)
“The KnowledgeFlow” (new process model inspired
interface)
Slide by Eibe Frank
8
Projects based on WEKA
Incorporate/wrap WEKA
GRB Tool Shed - a tool to aid gamma ray burst research
YALE - facility for large scale ML experiments
GATE - NLP workbench with a WEKA interface
Judge - document clustering and classification
Extend/modify WEKA
BioWeka - extension library for knowledge discovery in biology
WekaMetal - meta learning extension to WEKA
Weka-Parallel - parallel processing for WEKA
Grid Weka - grid computing using WEKA
Weka-CG - computational genetics tool library
Slide by Eibe Frank
9
The WEKA Project Today (2006)
Funding for the next two years
Goal of the project remains the same
People
6
2
3
3
2
staff
postdocs
PhD students
MSc students
research programmers
Slide by Eibe Frank
10
WEKA: The Software Toolkit
http://www.cs.waikato.ac.nz/ml/weka
http://sourceforge.net/projects/weka/
Machine learning/data mining software in Java
GNU License
Used for research, education and applications
Complements “Data Mining” by Witten & Frank
Main features:
data pre-processing tools
learning algorithms
evaluation methods
graphical interface (incl. data visualization)
environment for comparing learning algorithms
Slide adapted from Eibe Frank's
11
WEKA: Terminology
Some synonyms/explanations for the terms used
by WEKA, which may differ from what we use:
Attribute: feature
Relation: collection of examples
Instance: collection in use
Class: category
12
WEKA GUI Chooser
Slide adapted from Eibe Frank's
java -Xmx1000M -jar weka.jar
13
Our Toy Example
We demonstrate WEKA on a simple example:
3 categories from “Newsgroups”:
– misc.forsale,
– rec.sport.hockey,
– comp.graphics
20 documents per category
features:
– words converted to lowercase
– frequency 2 or more required
– stopwords removed
Slide adapted from Eibe Frank's
14
Explorer: Pre-Processing The Data
WEKA can import data from:
files: ARFF, CSV, C4.5, binary
URL
SQL database (using JDBC)
Pre-processing tools (filters) are used for:
Discretization, normalization, resampling,
attribute selection, transforming and
combining attributes, etc.
Slide adapted from Eibe Frank's
15
The Preprocessing Tab
Classification
Preprocessing
Statistical
attribute
selection
Filter selection
Manual attribute
selection
List of attributes
(last: class variable)
Statistics about
the values of the
selected attribute
Frequency and
categories for
the selected
attribute
16
Explorer: Building “Classifiers”
Classifiers in WEKA are models for:
classification (predict a nominal class)
regression (predict a numerical quantity)
Learning algorithms:
Naïve Bayes, decision trees, kNN, support
vector machines, multi-layer perceptron,
logistic regression, etc.
Meta-classifiers:
cannot be used alone
always combined with a learning algorithm
examples: boosting, bagging etc.
Slide adapted from Eibe Frank's
17
The Classification Tab
Choice of
classifier
Cross-validation: split the
data into e.g. 10 folds and
10 times train on 9 folds and
test on the remaining one
The attribute
whose value is to
be predicted from
the values of the
remaining ones.
Default is the last
attribute.
Here (in our toy
example) it is
named “class”.
18
Choosing a
classifier
19
20
displays synopsis
and options
False: Gaussian
True: kernels (better)
outputs additional
numerical to
information
nominal
conversion by
discretization
21
22
23
accuracy
different/easy class
all other
numbers can be
obtained from it
24
Confusion matrix
Contains information about the actual and the
predicted classification
All measures can be derived from it:
predicted
accuracy: (a+d)/(a+b+c+d)
recall: d/(c+d) => R
precision: d/(b+d) => P
F-measure: 2PR/(P+R)
false positive (FP) rate: b/(a+b)
true negative (TN) rate: a/(a+b)
false negative (FN) rate: c/(c+d)
true
–
+
–
a
b
+
c
d
These extend for more than 2 classes:
see previous lecture slides for details
25
Predictions Output
Outputs the
probability
distribution for
each example
26
Predictions Output
Probability
distribution
for a wrong
example:
predicted 1
instead of 3
Naïve Bayes
makes incorrect
conditional
independence
assumptions
and typically is
over-confident
in its prediction
regardless of
whether it is
correct or not.
27
Error Visualization
28
Error Visualization
Little squares
designate errors
Axes show
example number
29
Running on Test Set
30
Explorer: Attribute Selection
Find which attributes are the most predictive ones
Two parts:
search method:
– best-first, forward selection, random, exhaustive,
genetic algorithm, ranking
evaluation method:
– information gain, chi-squared, etc.
Very flexible: WEKA allows (almost) arbitrary
combinations of these two
Slide adapted from Eibe Frank's
31
Individual Features Ranking
32
Individual Features Ranking
misc.forsale
rec.sport.hockey
comp.graphics
33
Individual Features Ranking
misc.forsale
rec.sport.hockey
random
number
seed
comp.graphics
???
34
Saving the Selected Features
All we can do from this
tab is to save the
buffer in a text file. Not
very useful...
But we can also
perform feature
selection during the
pre-processing step...
(the following slides)
40
Features Selection on Preprocessing
41
Features Selection on Preprocessing
42
Features Selection on Preprocessing
679 attributes:
678 + 1 (for the class)
43
Features Selection on Preprocessing
Just 22 attributes
remain:
21 + 1 (for the class)
44
Run Naïve Bayes With the 21 Features
higher
accuracy
21 Attributes
45
(AGAIN) Naïve Bayes With All Features
accuracy
different/easy class
ALL 679 Attributes
(repeated slide)
46
Some Important Algorithms
WEKA has weird naming for some algorithms
Here are some translations:
Naïve Bayes: weka.classifiers.bayes.NaiveBayes
Perceptron: weka.classifiers.functions.VotedPerceptron
Decision tree: weka.classifiers.trees.J48
Support vector machines: weka.classifiers.functions.SMO
k nearest neighbor: weka.classifiers.lazy.IBk
Some of these are more sophisticated versions of the classic
algorithms
e.g. the classic Naïve Bayes seems to be missing
A good alternative is the Multinomial Naïve Bayes model
47
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
48
Performing Experiments
Experimenter makes it easy to compare the performance of
different learning schemes
Problems:
classification
regression
Results: written into file or database
Evaluation options:
cross-validation
learning curve
hold-out
Can also iterate over different parameter settings
Significance-testing built in!
Slide adapted from Eibe Frank's
49
Experiments Setup
50
Experiments Setup
51
Experiments Setup
datasets
CSV file: can be open
in Excel
algorithms
52
Experiments Setup
53
Experiments Setup
54
Experiments Setup
55
Experiments Setup
56
Experiments Setup
accuracy
SVM is
the best
SVM is
statistically
better than
Naïve Bayes
Decision tree is
statistically
worse than
Naïve Bayes
Decision
tree is the
worst
57
Experiments: Excel
Results are output into
an CSV file, which can
be read in Excel!
58
The Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
59
WEKA File Format: ARFF
@relation heart-disease-simplified
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
Numerical attribute
age numeric
Nominal attribute
sex { female, male}
chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
cholesterol numeric
exercise_induced_angina { no, yes}
Other attribute types:
class { present, not_present}
• String
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
Slide adapted from Eibe Frank's
• Date
Missing value
60
WEKA File Format: Sparse ARFF
Value 0 is not represented explicitly
Same header (i.e @relation and @attribute tags)
the @data section is different
Instead of
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
We have
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
This is especially useful for textual data (why?)
61
Python Interface to WEKA
This is just to get you started
Assumes the newsgroups collection
Extracts simple features
currently just single word features
– Uses a simple tokenizer which removes punctuation
uses a stoplist
lowercases the words
Includes filtering code
currently eliminates numbers
Features are weighted by frequency within document
Produces a sparse ARFF file to be used by WEKA
62
Python Interface to WEKA
Allows you to specify:
Which directory to read files from
which newsgroups to use
the number of documents for training each newsgroup
the number of features to retain
63
Python Interface to WEKA
Things to (optionally) add or change:
an option to not use stopwords
an option to retain capitalization
regular expression pattern a feature should match
other non-word-based features
morphological normalization
a minimum threshold for the number of time a term
occurs before it can be counted as a feature
tf.idf weighting on terms
your idea goes here
64
Python Interface to WEKA
TF.IDF: tij  log(N/ni)
TF
– tij: frequency of term i in document j
– this is how features are currently weighted
IDF: log(N/ni)
– ni: number of documents containing term i
– N: total number of documents
65
Python Weka Code
66
Python Weka Code
67
Python Weka Code
68
Python Weka Code
69
Python Weka Code
70
Python Weka Code
71
Python Weka Code
72
Python Weka Code
73
ARFF file
…
74
ARFF file
…
75
Assignment
Due November 13.
Work individually on this one
Objective is to use the training set to get the best
features and learning model you can.
FREEZE this.
Then run one time only on the test set.
This is a realistic way to see how well your algorithm
does on unseen data.
76
Next Time
Machine learning algorithms
77