Transcript NB,KNN,FS

贝叶斯分类、KNN、
特征选择、评估
陈翀
Ref : PengBo@NC&IS@PKU,
WBIA lecture
统计机器学习与数据挖掘技术与方法研讨班讲义
本次课大纲

文本分类Text Categorization
 Problem
definition
 Naïve Bayes Classifier
 K-Nearest Neighbor Classifier
 Evaluation
Categorization
Definition

Given:
 A description
of an instance, xX, where X is
the instance language or instance space.

Issue: how to represent text documents.
 A fixed
set of categories:
C = {c1, c2,…, cn}

Determine:
 The
category of x: c(x)C, where c(x) is a
categorization function whose domain is X
and whose range is C.

We want to know how to build categorization
functions (“classifiers”).
Text Categorization Examples
Assign labels to each document or web-page:
 Labels are most often topics such as
Yahoo-categories
 e.g.,
"finance," "sports,"
"news>world>asia>business"

Labels may be genres
 e.g.,

Labels may be opinion
 e.g.,

"editorials" "movie-reviews" "news“
“like”, “hate”, “neutral”
Labels may be domain-specific binary
Classification Methods

Manual classification
 Used
by Yahoo!, Looksmart, about.com, ODP,
Medline
 Accurate but expensive to scale

Automatic document classification
 Hand-coded

rule-based systems
Spam mail filter,…
 Supervised
learning of a document-label
assignment function


No free lunch: requires hand-classified training data
Note that many commercial systems use a
Naïve Bayes
统计机器学习与数据挖掘技术与方法研讨班讲义
Bernoulli trial

Bernoulli trial is an
experiment whose
outcome is random
and can be either of
two possible
outcomes, "success"
and "failure".
Binomial Distribution

binomial distribution
is the discrete
probability distribution
of the number of
successes in a
sequence of n
independent yes/no
experiments, each of
which yields success
with probability p.
Multinomial Distribution

multinomial
distribution is a
generalization of the
binomial distribution.
 each
trial results in one of
some fixed finite number
k of possible outcomes,
with probabilities p1, ...,
pk,
 there are n independent
trials.
 We can use a random
variable Xi to indicate the
number of times outcome
Bayes’ Rule
prior
posterior
likelihood
Use Bayes Rule to Gamble
someone draws an envelope at random
and offers to sell it to you. How much
should you pay?
 before deciding, you are allowed to see
one bead drawn from the envelope.
Suppose it’s red: How much should you
pay?

Prosecutor's fallacy
You win the lottery jackpot.
You are then charged with
having cheated, for instance
with having bribed lottery
officials.
 At the trial, the prosecutor
points out that winning the
lottery without cheating is
extremely unlikely, and that
therefore your being
innocent must be

Likelihood
a likelihood function is a conditional
probability function considered as a function
of its second argument with its first argument
held fixed
 Given a parameterized family of probability
x  f (x |  )
density functions


L( parameter,
| x)  f ( x |  ) the likelihood
Where θ is the
function is

where x is the observed outcome of an
Maximum a posteriori Hypothesis
hMAP  argmax P(h | D)
hH
P ( D | h) P ( h)
 argmax
P( D)
hH
 argmax P( D | h) P(h)
hH
As P(D) is
constant
Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely,
we only
need to consider the P(D|h) term:
hML  argmax P( D | h)
hH
Bayes Classifiers
Task: Classify a new instance D based
D  x1 , x2 ,, xn
on a tuple of attribute values
into one of the classes cj  C
cMAP  argmax P(c j | x1 , x2 ,, xn )
c j C
 argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P( x1 , x2 ,, xn )
 argmax P( x1 , x2 ,, xn | c j ) P(c j )
c j C
Naïve Bayes Assumption

P(cj)
 Can
be estimated from the frequency of
classes in the training examples.

P(x1,x2,…,xn|cj)
 O(|X|n•|C|)
parameters
 Could only be estimated if a very, very large
number of training examples was available.
Naïve Bayes Conditional Independence Assumption:

Assume that the probability of observing the
conjunction of attributes is equal to the product of
the individual probabilities P(xi|cj).
The Naïve Bayes Classifier
Flu
X1
runnynose

X2
sinus
X3
cough
X4
fever
X5
muscle-ache
Conditional Independence Assumption:
features detect term presence and are
independent of each other given the class:
P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )   P( X 5 | C )
This model is appropriate for binary
variables Multivariate binomial
model
Learning the Model
C
X1

X2
X3
X4
X5
X6
First attempt: maximum likelihood
estimates
 simply
use the frequencies in the data
Pˆ (c j ) 
Pˆ ( xi | c j ) 
N (C  c j )
N
N ( X i  xi , C  c j )
N (C  c j )
Problem with Max Likelihood
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )   P( X 5 | C )
 What if we have seen no training cases where
patient had no flu and muscle aches?
N ( X 5  t , C  nf )
ˆ
P( X 5  t | C  nf ) 
0
N (C  nf )

Zero probabilities cannot be conditioned away, no
matter the other evidence!
  arg max c Pˆ (c)i Pˆ ( xi | c)
Smoothing to Eliminate Zeros
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )  1
N (C  c j )  k
# of values of X,


Add one smooth (Laplace smoothing)
As a uniform prior (each attribute occurs once
for each class) that is then updated as evidence
from the training data comes in.
Document Generative Model
“Love is patient, love is kind. “
 Basic : bag of words
Love is kind
 a binary independence model
patient

 Multivariate
binomial generation
 feature Xi is term
 Value Xi = 1 or 0, indicate term Xi present in
doc or not

a multinomial unigram language model
 Multinomial
generation
 feature Xi is term position
 Value of Xi = term at position i
Bernoulli Naive Bayes Classifiers

Multivariate binomial Model
feature Xw for each word in dictionary
 Xw = true in document d if w appears in d
 Naive Bayes assumption:
 One

Given the document’s topic, appearance of one
word in the document tells us nothing about
chances that another word appears
Multinomial Naive Bayes Classifiers II

Multinomial = Class conditional unigram
feature Xi for each word pos in
document
 One

feature’s values are all words in dictionary
of Xi is the word in position i
 Naïve Bayes assumption:
 Value

Given the document’s topic, word in one position
in the document tells us nothing about words in
other positions
cNB  argmax P(c j ) P( xi | c j )
c jC
i
 argmax P(c j ) P( x1 " our" | c j )  P( xn " text" | c j )
c jC
Multinomial Naive Bayes Classifiers

Still too many possibilities
Assume that classification is independent of
the positions of the words

Second assumption:

 Word
appearance does not depend on position
P( X i  w | c)  P( X j  w | c)
for all positions i,j, word w, and class c
 Just
have one multinomial feature predicting
for all words
 Use same parameters for each position
Parameter estimation



Binomial model:
Pˆ ( X w  t | c j ) 
fraction of documents of topic cj
in which word w appears
Multinomial model:
Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears
across all documents of topic cj
Can create a mega-document for topic j by
concatenating all documents in this topic
Naive Bayes algorithm (Multinomial
model)
Naive Bayes algorithm (Bernoulli model)
NB Example

c(5)=?
Multinomial NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) = China
Bernoulli NB Classifier

Feature likelihood estimate

Posterior
Classification

Multinomial vs Multivariate binomial?
 Multinomial
is in general better
Feature Selection
统计机器学习与数据挖掘技术与方法研讨班讲义
Feature Selection: Why?

Text collections have a large number of
features
 10,000
– 1,000,000 unique words … and
more

May make using a particular classifier
feasible
 Some
classifiers can’t deal with 100,000 of
features

Reduces training time
 Training
time for some methods is quadratic
or worse in the number of features

Can improve generalization
Feature selection: how?

Two idea:
 Hypothesis


Are we confident that the value of one categorical variable is
associated with the value of another
Chi-square test
 Information



testing statistics:
theory:
How much information does the value of one categorical variable
give you about the value of another
Mutual information
They’re similar, but 2 measures confidence in association,
(based on available statistics), while MI measures extent of
association (assuming perfect knowledge of probabilities)
Pearson's chi-square test
Example: Is six-sided die
"fair“?
 null hypothesis : six-sided
die "fair“ is fair. there will
be an theoretical frequency
of possible outcome
 Chi-square is calculated by

n
Oi  Ei 2
i 1
Ei
 
2
Oi observed frequency
Ei theoretical frequency
2 statistic (CHI)
Term = jaguar
Class = auto
Class  auto
2 (0.25)

3 (4.75)
Term  jaguar
500
expected: fe
(502)
9500 (9498)
observed: fo
2
2
2
(
j
,
a
)

(
O

E
)
/
E

(
2

.
25
)
/
.
25

(
3

4
.
75
)
/
4
.
75
2
2
2

(
500

502
)
/
502

(
9500

9498
)
/
949

12
.
9
(
p

.
0
)
null hypothesis: event term, class are
independent
 The null hypothesis is rejected with

Example
the class poultry and the term export
 The counts of the number of documents
with the four possible combinations

2 statistic (CHI)
There is a simpler formula for 2x2 2:
A = #(t,c)
C = #(¬t,c)
B = #(t,¬c)
D = #(¬t, ¬c)
N=A+B+C+D
Feature selection via Mutual Information
In training set, choose k words which best
discriminate (give most info on) the
categories.
 MI measures how much information the
presence /absence of a term contributes to
making the correct classification decision
on c.
 The Mutual Information between a word,
class is:

Example
Feature selection via MI (contd.)
For each category we build a list of k most
discriminating terms.
 For example (on 20 Newsgroups):

 sci.electronics:
circuit, voltage, amp, ground,
copy, battery, electronics, cooling, …
 rec.autos: car, cars, engine, ford, dealer,
mustang, oil, collision, autos, tires, toyota, …
Greedy: does not account for correlations
between terms
 Why?

Feature Selection

Mutual Information
 Clear

information-theoretic interpretation
Chi-square
 Statistical
foundation
 May select more rare terms than MI

Just use the commonest terms?
 No
particular foundation
 In practice, this is often 90% as good
Feature selection for NB
In general feature selection is necessary
for binomial NB.
 Otherwise you suffer from noise, multicounting
 “Feature selection” really means
something different for multinomial NB. It
means dictionary truncation

 The
multinomial NB model only has 1 feature
K-Nearest Neighbors
统计机器学习与数据挖掘技术与方法研讨班讲义
Recall: Vector Space Representation
Each document is a vector, one
component for each term (= word).
 Normalize to unit length.
 High-dimensional vector space:

 Terms
are axes
 10,000+ dimensions, or even 100,000+
 Docs are vectors in this space
Classes in a Vector Space
Government
Science
Arts
Classification Using Vector Spaces
Each training doc a point (vector) labeled
by its topic (= class)
 Hypothesis: docs of the same class form
a contiguous region of space
 We define surfaces to delineate classes in
space

Test Document = Government
Similarity
hypothesis
true in
general?
Government
Science
Arts
k Nearest Neighbor Classification





To classify document d into class c
Define k-neighborhood N as k nearest
neighbors of d
Count number of documents i in N that
belong to c
Estimate P(c|d) as i/k
Choose as class argmaxc P(c|d) [ =
majority class]
Example: k=6 (6NN)
P(science| )?
Government
Science
Arts
Nearest-Neighbor Learning Algorithm
Learning is just storing the representations
of the training examples in D.
 Testing instance x:

 Compute
similarity between x and all
examples in D.
 Assign x the category of the most similar
example in D.
Does not explicitly compute a
generalization or category prototypes.
 Also called:

 Case-based
learning
 Memory-based learning
Why K?

Using only the closest example to
determine the categorization is subject to
errors due to:
 A single
atypical example.
 Noise (i.e. error) in the category label of a
single training example.
More robust alternative is to find the k
most-similar examples and return the
majority category of these k examples.
 Value of k is typically odd to avoid ties; 3
and 5 are most common.

kNN decision boundaries
Boundaries
are in
principle
arbitrary
surfaces –
but usually
polyhedra
Government
Science
Arts
Similarity Metrics

Nearest neighbor method depends on a
similarity (or distance) metric.
 Simplest
for continuous m-dimensional
instance space is Euclidean distance.
 Simplest for m-dimensional binary instance
space is Hamming distance (number of
feature values that differ).

For text, cosine similarity of tf.idf weighted
vectors is typically most effective.
Illustration of 3 Nearest Neighbor for
Text Vector Space
Nearest Neighbor with Inverted Index
Naively finding nearest neighbors requires
a linear search through |D| documents in
collection
 But determining k nearest neighbors is the
same as determining the k best retrievals
using the test document as a query to a
database of training documents.
 Use standard vector space inverted index
methods to find the k nearest neighbors.
 Testing Time: O(B|Vt|)
where B is the

average number of training documents in which a
test-document word appears.
kNN: Discussion
No feature selection necessary
 Scales well with large number of classes

 Don’t

need to train n classifiers for n classes
Classes can influence each other
 Small
changes to one class can have ripple
effect
Scores can be hard to convert to
probabilities
 No training necessary

Bias vs. variance:
Choosing the correct model
capacity
kNN vs. Naive Bayes

Bias/Variance tradeoff
 Variance

kNN has high variance and low bias.
 Infinite

≈ Capacity
memory
NB has low variance and high bias.
 Decision
surface has to be linear (hyperplane)
Summary

Categorization
 Training
data
 Over-fitting & Generalize

Naïve Bayes
 Bayesian
n
Oi  Ei 2
i 1
Ei
 
2
 Bernoulli NB classifier


Multinomial NB classifier
K-Nearest Neighbor
 Bias
cNB  argmax P(c j ) P( xi | c j )
c jC
i
Methods

.vs. Variance
Feature selection
 Chi-square
test
 argmax P(c j ) P( x1 " our" | c j )  P( xn " text" | c j )
 Mutual Information
c jC
Readings
[1] IIR Ch13, Ch14.2
 [2] Y. Yang and X. Liu, "A re-examination
of text categorization methods," presented
at Proceedings of ACM SIGIR Conference
on Research and Development in
Information Retrieval (SIGIR'99), 1999.

Classification
Evaluation
统计机器学习与数据挖掘技术与方法研讨班讲义
Evaluation: Classic Reuters Data Set
Most (over)used data set
 21578 documents
 9603 training, 3299 test articles
(ModApte split)
 118 categories

 An
article can be in more than one
category
Common categories
 Learn 118 (#train,
binary
category distinctions
#test)
• Earn (2877, 1087)about
• Trade
(369,119)
Average document:
90
types,
• Acquisitions (1650, 179) • Interest (347, 131)
200 tokens• Money-fx (538, 179) • Ship (197, 89)
• Grain (433, 149)
• Wheat (212, 71)
• Crude (389,of
189)classes
• Cornassigned
(182, 56)
 Average number

Reuters Text Categorization data set
(Reuters-21578) document
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE>
kicks off
CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states
determining industry positions on a number of issues, according to the National Pork Producers
Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector.
The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all
areas of the industry, the NPPC added. Reuter
&#3;</BODY></TEXT></REUTERS>
New Reuters: RCV1: 810,000 docs

Top topics in Reuters RCV1
北大天网: 中文网页分类
通过动员不同专业的几十个学生,人工选
取形成了一个基于层次模型的大规模中文
网页样本集。
 包括12,336个训练网页实例和3,269个测试
网页实例,分布在12个大类,共计733个类
别中,每个类别平均有17个训练实例和4.5
个测试实例
 天网免费提供网页样本集给有兴趣的同行,
燕穹产品号:YQ-WEBENCH-V0.8
 中文信息检索论坛www.cwirf.org
 全国搜索引擎和网上信息挖掘学术研讨会
SEWM上进行分类评测

北大天网: 中文网页分类
类别名称
24
训练
样本数
419
测试
样本数
110
新闻与媒体
7
125
19
3
商业与经济
48
839
214
4
娱乐与休闲
88
1510
374
5
计算机与因特网
58
925
238
6
教育
18
286
85
7
各国风情
53
891
235
8
自然科学
113
1892
514
9
政府与政治
18
288
84
10
社会科学
104
1765
479
11
医疗与健康
136
2295
616
12
社会与文化
66
1101
301
共计
733
12336
3269
类别编
号
1
人文与艺术
2
类别数
Measuring Classification Figures of Merit

Accuracy of classification
 Main
evaluation criterion in academia
 More in a moment

Speed of training statistical classifier
 Some
methods are very cheap; some very
costly

Speed of classification (docs/hour)
 No
big differences for most algorithms
 Exceptions: kNN, complex preprocessing
requirements

Effort in creating training set/hand-built
classifier
Measuring Classification Figures of Merit

In the real world, economic measures:
 Your

Do no classification


Has an easy to compute cost if doing it like that now
Do it all with an automatic classifier


That has a cost (hard to compute)
Do it all manually


choices are:
Mistakes have a cost
Do it with a combination of automatic
classification and manual review of
uncertain/difficult/”new” cases
 Commonly
the last method is most cost
efficient and is adopted
Concept Drift
Categories change over time
 Example: “president of the united states”

 1999:
clinton is great feature
 2002: clinton is bad feature
One measure of a text classification
system is how well it protects against
concept drift.
 Feature selection: can be bad in protecting
against concept drift

Per class evaluation measures
Actual Class
A



B
C
A
Predicted B
class
C
cii
Recall: Fraction of docs in class i
j cij
classified correctly:
Precision: Fraction of docs assigned class cii
c ji

i that are actually about class i:
j
“Correct rate”: (1- error rate) Fraction of
i cii
docs classified correctly:

j
i
cij
Measures of Accuracy
Evaluation must be done on test data that
is independent of the training data
 Overall error rate

 Not
a good measure for small classes. Why?
Precision/recall for classification decisions
 F1 measure: 1/F1 = ½ (1/P + 1/R)
 Correct estimate of size of category

 Why
is this different?
Stability over time / category drift
 Utility

 Costs
of false positives / false negatives may
Good practice department:N-Fold
Cross-Validation



Results can vary based on sampling error due to
different training and test sets.
Average results over multiple training and test sets (splits
of the overall data) for the best results.
Ideally, test and training sets are independent on each
trial.
 But





this would require too much labeled data.
Partition data into N equal-sized disjoint segments.
Run N trials, each time using a different segment of the
data for testing, and training on the remaining N1
segments.
This way, at least test-sets are independent.
Report average classification accuracy over the N trials.
Typically, N = 10.
Good practice department: Learning Curves
In practice, labeled data is usually rare
and expensive.
 Would like to know how performance
varies with the number of training
instances.
 Learning curves plot classification
accuracy on independent test data (Y axis)
versus number of training examples (X
axis).
 One can do both the above and produce
learning curves averaged over multiple
trials from cross-validation

how to combine multiple measures?
If we have more than one class, how do
we combine multiple performance
measures into one quantity?
 Macroaveraging:



Compute performance for each class, then
average.
Microaveraging:

Collect decisions for all classes, compute
contingency table, evaluate.
Micro- vs. Macro-Averaging: Example
Class 1
Truth:
yes
Class 2
Truth:
no
Truth:
yes
Classi 10
fier:
yes
10
Classifi 90
er: yes
Classi 10
fier:
no
970
Classifi 10
er: no



Micro.Av. Table
Truth:
no
10
890
Truth:
yes
Truth:
no
Classifi
er: yes
100
20
Classifi
er: no
20
1860
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7
Microaveraged precision: 100/120 = .83
Why this difference?