lec11-april28

Download Report

Transcript lec11-april28

CES 514
Lec 11
April 28,2010
Neural Network, case study of naïve Bayes
and decision tree, text classification
Artificial Neural Networks (ANN)
X1
X2
X3
Y
Input
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
0
0
1
1
1
0
0
1
0
X1
X2
Black box
Output
Y
X3
Output Y is 1 if at least two of the three inputs are equal
to 1.
Neural Network with one neuron
Rosenblatt 1958(perceptron) (also known as
threshold logic unit)
X1
X2
X3
Y
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
0
0
1
1
1
0
0
1
0
Input
nodes
Black box
X1
X2
X3
Output
node
0.3
0.3
0.3

t=0.4
Y  I (0.3 X 1  0.3 X 2  0.3 X 3  0.4  0)
1
where I ( z )  
0
if z is true
otherwise
Y
Artificial Neural Networks (ANN)

Model is an assembly
of inter-connected
nodes and weighted
links
Input
nodes
Black box
X1
w1
w2
X2


Output node sums up
each of its input
value according to the
weights of its links
Compare output node
against some threshold
t
Output
node

Y
w3
X3
t
Perceptron Model
Y  I (  wi X i  t )
i
or
Y  sign (  wi X i  t )
i
Training a single neuron
Rosenblatt’s algorithm:
Linearly separable instances
Rosenblatt’s algorithm
converges and finds a
separating plane when
the data set is
linearly separable.
Simplest example of
instance that is not
linearly separable:
exclusive-OR (parity
function)
Classifying parity with more neurons
A neural network with sufficient number of
neurons can classify any data set correctly.
General Structure of ANN
x1
x2
x3
Input
Layer
x4
x5
Input
I1
I2
Hidden
Layer
I3
Neuron i
Output
wi1
wi2
wi3
Si
Activation
function
g(Si )
Oi
Oi
threshold, t
Output
Layer
y
Training ANN means
learning the weights of
the neurons
Algorithm for learning ANN

Initialize the weights (w0, w1, …, wk)

Adjust the weights in such a way that the
output of ANN is consistent with class labels
of training examples
– Objective function:
E   Yi  f ( wi , X i )
2
i
– Find the weights wi’s that minimize
the above objective function


e.g., backpropagation algorithm
details: Nillson’s ML (Chapter 4) PDF
WEKA
WEKA implementation
WEKA has implementation of all the major
data mining algorithms including:
• decision trees (CART, C4.5 etc.)
•
•
•
•
•
•
naïve Bayes algorithm and all variants
nearest neighbor classifier
linear classifier
Support Vector Machine
clustering algorithms
boosting algorithms etc.
Weka tutorials
http://sentimentmining.net/weka/
Contains videos showing how to use weka for
various data mining applications.
A case study in classification
CES 514 course project from 2007 (Olson)
Consider a board game (e.g checkers,
backgammon). Given a position, we want to
determine how strong the position of one player
(say black) is.
Can we train a classifier to learn this from
training set?
As usual, problems are:
• choice of attributes
• creating labeled samples
Peg Solitaire – one player version of
checkers
• To win, player should remove all except one
peg.
• A position from which a win can achieved is
called a solvable position.
Square board and a solvable position
Winning move sequence: (3, 4, 5), (5, 13, 21),
(25, 26, 27), (27, 28, 29), (21, 29, 37), (37,
45, 53), (83, 62, 61), (61, 53, 45)
How to choose attributes?
1. Number of pegs (pegs).
2. Number of first moves for any peg on the board
(first_moves).
3. Number of rows having 4 pegs separated by
single vacant positions (ideal_row).
4. Number of columns having 4 pegs separated by
single vacant positions (ideal col).
5. Number of the first two moves for any peg on
the board (first_two).
6. Percentage of the total number of pegs in
quadrant one (quad_one).
7. Percentage of the total number of pegs in
quadrant two (quad_two).
List of attributes
•Percentage of the total number of pegs in
quadrant three (quad_three).
•Percentage of the total number of pegs in
quadrant four (quad_four).
•Number of pegs isolated by one vacant position
(island_one).
•Number of pegs isolated by two vacant positions
(island_two).
•Number of rows having 3 pegs separated by
single vacant positions (ideal_row_three).
•Number of columns having 3 pegs separated by
single vacant positions (ideal_col_three).
Summary of
performance
Text Classification
• Text classification has many applications
– Spam email detection
– Automated tagging of streams of news articles, e.g., Google
News
– Online advertising: what is this Web page about?
• Data Representation
– “Bag of words” most commonly used: either counts or binary
– Can also use “phrases” (e.g., bigrams) for commonly occurring
combinations of words
• Classification Methods
– Naïve Bayes widely used (e.g., for spam email)
• Fast and reasonably accurate
– Support vector machines (SVMs)
• Typically the most accurate method in research studies
• But more complex computationally
– Logistic Regression (regularized)
• Not as widely used, but can be competitive with SVMs (e.g., Zhang
and Oles, 2002)
Ch. 13
Types of Labels/Categories/Classes
• Assigning labels to documents or web-pages
– Labels are most often topics such as Yahoocategories
– "finance“,"sports,"news>world>asia>business"
• Labels may be genres
– "editorials" "movie-reviews" "news”
• Labels may be opinion on a person/product
– “like”, “hate”, “neutral”
• Labels may be domain-specific
– "interesting-to-me" : "not-interesting-to-me”
– “contains adult language” : “doesn’t”
– language identification: English, French,
Chinese, …
Common Data Sets used for Evaluation
• Reuters
– 10700 labeled documents
– 10% documents with multiple class labels
• Yahoo! Science Hierarchy
– 95 disjoint classes with 13,598 pages
• 20 Newsgroups data
– 18800 labeled USENET postings
– 20 leaf classes, 5 root level classes
• WebKB
– 8300 documents in 7 categories such as
“faculty”, “course”, “student”.
Practical Issues
• Tokenization
– Convert document to word counts = “bag of words”
– word token = “any nonempty sequence of characters”
– for HTML (etc) need to remove formatting
• Canonical forms, Stopwords, Stemming
– Remove capitalization
– Stopwords:
• remove very frequent words (a, the, and…) – can use
standard list
• Can also remove very rare words, e.g., words that
only occur in k or fewer documents, e.g., k = 5
• Data representation
– e.g., sparse 3 column for bag of words:
count>
– can use inverted indices, etc
<docid
termid
challenges of text classification
 M.L classification techniques used for
structured data
 Text: lots of features and lot of noise
 No fixed number of columns
 No categorical attribute values
 Data scarcity
 Larger number of class label
 Hierarchical relationships between classes
less systematic unlike structured data
Techniques
 Nearest Neighbor Classifier
• Lazy learner: remember all training instances
• Decision on test document: distribution of labels
on the training documents most similar to it
Assigns large weights to rare terms
•
 Feature selection
• removes terms in the training documents which
are statistically uncorrelated with the class labels
 Bayesian classifier
• Fit a generative term distribution Pr(d|c) to each
•
class c of documents .
Testing: The distribution most likely to have
generated a test document is used to label it.
Sec.13.2.1
Stochastic Language Models
 Model probability of generating strings (each
word in turn) in a language (commonly all
strings over alphabet ∑). E.g., a unigram model
Model M
0.2
the
0.1
a
0.01
man
0.01
woman
0.03
said
0.02
likes
…
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
multiply
P(s | M) = 0.00000008
Sec.13.2.1
Stochastic Language Models
 Model probability of generating any
string
Model M1
Model M2
0.2
the
0.2
the
0.01
class
0.0001 class
0.0001 sayst
0.03
0.0001 pleaseth
0.02
0.2
pleaseth 0.2
0.0001 yon
0.1
yon
0.0005 maiden
0.01
maiden
0.01
0.0001 woman
woman
sayst
the
class
pleaseth
0.01
0.0001
0.0001 0.02
yon
maiden
0.0001 0.0005
0.1
0.01
p(s|M2) > p(s|M1)
Sec.13.2
Using Multinomial Naive Bayes Classifiers
to Classify Text: Basic method
 Attributes are text positions, values are
words.
cNB  argmax P(c j ) P( xi | c j )
c jC
i
 argmax P(c j ) P( x1 " our" | c j )  P( xn " text" | c j )
c jC


too many possibilities
Assume that classification is independent of
the positions of the words


Use same parameters for each position
Result is bag of words model (over tokens)
Sec.13.2
Naive Bayes: Learning
 From training corpus, extract vocabulary
 Calculate required P(cj) and P(xk | cj) terms
 For each cj in C do
 docsj  subset of documents for which the target
class is cj



P (c j ) 
| docs j |
| total # documents |
Textj  single document containing all docsj
for each word xk in Vocabulary
 nk  number of occurrences of xk in Textj

P( xk | c j ) 
nk  
n   | Vocabulary |
Sec.13.2
Naive Bayes: Classifying
 positions  all word positions in current document
which contain tokens found in Vocabulary
 Return cNB, where
cNB  argmax P(c j )
c jC
 P( x | c )
i
i positions
j
Sec.13.2
Naive Bayes: Time Complexity
 Training Time:
O(|D|Lave + |C||V|))
where Lave is the average length of a document in
D.
 Assumes all counts are pre-computed in O(|D|Lave)
time during one pass through all of the data.
 Generally just O(|D|Lave) since usually |C||V| <
|D|Lave
 Test Time: O(|C| Lt)
where Lt is the average length of a test
document.
 Very efficient overall, linearly proportional to
the time needed to just read in all the data.
Sec.13.2
Underflow Prevention: using logs
 Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
 Since log(xy) = log(x) + log(y), it is better
to perform all computations by summing logs
of probabilities rather than multiplying
probabilities.
 Class with highest final un-normalized log
probability score is still the most probable.
c NB  argmax [log P(c j ) 
c j C
log P(x
i
| c j )]
ipositions
 Note that model is now just max of sum of
weights…
Naive Bayes Classifier
c NB  argmax [log P(c j ) 
c j C
log P(x
i
| c j )]
ipositions
 Simple interpretation: Each conditional
parameter log P(xi|cj) is a weight that
indicates how good an indicator xi is for cj.
 The prior log P(cj) is a weight that indicates
the relative frequency of cj.
 The sum is then a measure of how much
evidence there is for the document being in
the class.
 We select the class with the most evidence
for it
39
Two Naive Bayes Models
 Model 1: Multivariate Bernoulli
 One feature Xw for each word in dictionary
 Xw = true in document d if w appears in d
 Naive Bayes assumption:
 Given the document’s topic, appearance of one word
in the document tells us nothing about chances that
another word appears
 This is the model used in the binary
independence model in classic
probabilistic relevance feedback on handclassified data.
Two Models
 Model 2: Multinomial = Class conditional
unigram
 One feature Xi for each word pos in document
 feature’s values are all words in dictionary
 Value of Xi is the word in position i
 Naïve Bayes assumption:
 Given the document’s topic, word in one position in
the document tells us nothing about words in other
positions
 Second assumption:
 Word appearance does not depend on position
P( X i  w | c)  P( X j  w | c)
for all positions i,j, word w, and class c
 Just have one multinomial feature predicting all
words
Parameter estimation
 Multivariate Bernoulli model:
Pˆ ( X w  t | c j )  fraction of documents of topic cj
in which word w appears
 Multinomial model:
Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears among all
words in documents of topic cj
 Can create a mega-document for topic j by
concatenating all documents in this topic
 Use frequency of w in mega-document
Classification
 Multinomial vs Multivariate Bernoulli?
 Multinomial model is almost always more
effective in text applications
Sec.13.5
Feature Selection: Why?
 Text collections have a large number of
features
 10,000 – 1,000,000 unique words … and more
 May make using a particular classifier
feasible
 Some classifiers can’t deal with 100,000 of
features
 Reduces training time
 Training time for some methods is quadratic
or worse in the number of features
 Can improve generalization (performance)
 Eliminates noise features
 Avoids overfitting
Sec.13.5
Feature selection: how?
 Two ideas:
 Hypothesis testing statistics:
 Are we confident that the value of one categorical
variable is associated with the value of another?
 Chi-square test (2)
 Information theory:
 How much information does the value of one
categorical variable give you about the value of
another?
 Mutual information
 They’re similar, but 2 measures confidence in
association, (based on available statistics), while MI
measures extent of association (assuming perfect
knowledge of probabilities)
2 statistic (CHI)
Sec.13.5.2
 2 is interested in (fo – fe)2/fe summed over all table
entries: is the observed number what you’d expect
given the marginals?

2
2
2
(
j
,
a
)

(
O

E
)
/
E

(
2

.
25
)
/
.
25

(
3

4
.
75
)
/
4
.
75
2
2
2

(
500

502
)
/
502

(
9500

9498
)
/
949

12
.
9
(
p

.
0
)
 The null hypothesis is rejected with confidence .999,
 since 12.9 > 10.83 (the value for .999 confidence).
Term = jaguar Term  jaguar
Class = auto
Class  auto
2 (0.25)
3 (4.75)
500
expected: fe
(502)
9500 (9498)
observed: fo
Sec.13.5.2
2 statistic
There is a simpler formula for 2x2 2:
A = #(t,c)
C = #(¬t,c)
B = #(t,¬c)
D = #(¬t, ¬c)
N=A+B+C+D
Sec.13.5.1
Feature selection via Mutual
Information
 In training set, choose k words which
best discriminate (give most info on)
the categories.
 The Mutual Information between a word,
class is:
p(ew , ec )
I (w, c )    p(ew , ec ) log
p(ew )p(ec )
e { 0,1} e { 0,1}
w
c
 For each word w and each category c
Sec.13.5.1
Feature selection via MI
 For each category we build a list of k
most discriminating terms.
 For example (on 20 Newsgroups):
 sci.electronics: circuit, voltage, amp,
ground, copy, battery, electronics, cooling, …
 rec.autos: car, cars, engine, ford, dealer,
mustang, oil, collision, autos, tires, toyota,
…
 Greedy: does not account for correlations
between terms
 Why?
Sec.13.5
Feature Selection
 Mutual Information
 Clear information-theoretic interpretation
 May select rare uninformative terms
 Chi-square
 Statistical foundation
 May select very slightly informative
frequent terms that are not very useful for
classification
 Just use the commonest terms?
 No particular foundation
 In practice, this is often 90% as good
Greedy inclusion algorithm


•
•
•
•
Most commonly used in text
Algorithm:
Compute, for each term, a measure of discrimination
amongst classes.
Arrange the terms in decreasing order of this measure.
Retain a number of the best terms or features for use by
the classifier.
Greedy because
• measure of discrimination of a term is
•
computed independently of other terms
Over-inclusion: mild effects on accuracy
Feature selection - performance
• Bayesian classifier cannot over fit much
Effect of feature selection on Bayesian
classifiers
Sec.13.6
Naive Bayes vs. other methods
57
Benchmarks for accuracy
 Reuters
• 10700 labeled documents
• 10% documents with multiple class labels
 OHSUMED
• 348566 abstracts from medical journals
 20NG
• 18800 labeled USENET postings
• 20 leaf classes, 5 root level classes
 WebKB
• 8300 documents in 7 academic categories.