Transcript NB,KNN,FS
贝叶斯分类、KNN、
特征选择、评估
陈翀
Ref : PengBo@NC&IS@PKU,
WBIA lecture
统计机器学习与数据挖掘技术与方法研讨班讲义
本次课大纲
文本分类Text Categorization
Problem
definition
Naïve Bayes Classifier
K-Nearest Neighbor Classifier
Evaluation
Categorization
Definition
Given:
A description
of an instance, xX, where X is
the instance language or instance space.
Issue: how to represent text documents.
A fixed
set of categories:
C = {c1, c2,…, cn}
Determine:
The
category of x: c(x)C, where c(x) is a
categorization function whose domain is X
and whose range is C.
We want to know how to build categorization
functions (“classifiers”).
Text Categorization Examples
Assign labels to each document or web-page:
Labels are most often topics such as
Yahoo-categories
e.g.,
"finance," "sports,"
"news>world>asia>business"
Labels may be genres
e.g.,
Labels may be opinion
e.g.,
"editorials" "movie-reviews" "news“
“like”, “hate”, “neutral”
Labels may be domain-specific binary
Classification Methods
Manual classification
Used
by Yahoo!, Looksmart, about.com, ODP,
Medline
Accurate but expensive to scale
Automatic document classification
Hand-coded
rule-based systems
Spam mail filter,…
Supervised
learning of a document-label
assignment function
No free lunch: requires hand-classified training data
Note that many commercial systems use a
Naïve Bayes
统计机器学习与数据挖掘技术与方法研讨班讲义
Bernoulli trial
Bernoulli trial is an
experiment whose
outcome is random
and can be either of
two possible
outcomes, "success"
and "failure".
Binomial Distribution
binomial distribution
is the discrete
probability distribution
of the number of
successes in a
sequence of n
independent yes/no
experiments, each of
which yields success
with probability p.
Multinomial Distribution
multinomial
distribution is a
generalization of the
binomial distribution.
each
trial results in one of
some fixed finite number
k of possible outcomes,
with probabilities p1, ...,
pk,
there are n independent
trials.
We can use a random
variable Xi to indicate the
number of times outcome
Bayes’ Rule
prior
posterior
likelihood
Use Bayes Rule to Gamble
someone draws an envelope at random
and offers to sell it to you. How much
should you pay?
before deciding, you are allowed to see
one bead drawn from the envelope.
Suppose it’s red: How much should you
pay?
Prosecutor's fallacy
You win the lottery jackpot.
You are then charged with
having cheated, for instance
with having bribed lottery
officials.
At the trial, the prosecutor
points out that winning the
lottery without cheating is
extremely unlikely, and that
therefore your being
innocent must be
Likelihood
a likelihood function is a conditional
probability function considered as a function
of its second argument with its first argument
held fixed
Given a parameterized family of probability
x f (x | )
density functions
L( parameter,
| x) f ( x | ) the likelihood
Where θ is the
function is
where x is the observed outcome of an
Maximum a posteriori Hypothesis
hMAP argmax P(h | D)
hH
P ( D | h) P ( h)
argmax
P( D)
hH
argmax P( D | h) P(h)
hH
As P(D) is
constant
Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely,
we only
need to consider the P(D|h) term:
hML argmax P( D | h)
hH
Bayes Classifiers
Task: Classify a new instance D based
D x1 , x2 ,, xn
on a tuple of attribute values
into one of the classes cj C
cMAP argmax P(c j | x1 , x2 ,, xn )
c j C
argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P( x1 , x2 ,, xn )
argmax P( x1 , x2 ,, xn | c j ) P(c j )
c j C
Naïve Bayes Assumption
P(cj)
Can
be estimated from the frequency of
classes in the training examples.
P(x1,x2,…,xn|cj)
O(|X|n•|C|)
parameters
Could only be estimated if a very, very large
number of training examples was available.
Naïve Bayes Conditional Independence Assumption:
Assume that the probability of observing the
conjunction of attributes is equal to the product of
the individual probabilities P(xi|cj).
The Naïve Bayes Classifier
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
Conditional Independence Assumption:
features detect term presence and are
independent of each other given the class:
P( X 1 ,, X 5 | C ) P( X 1 | C ) P( X 2 | C ) P( X 5 | C )
This model is appropriate for binary
variables Multivariate binomial
model
Learning the Model
C
X1
X2
X3
X4
X5
X6
First attempt: maximum likelihood
estimates
simply
use the frequencies in the data
Pˆ (c j )
Pˆ ( xi | c j )
N (C c j )
N
N ( X i xi , C c j )
N (C c j )
Problem with Max Likelihood
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X 1 ,, X 5 | C ) P( X 1 | C ) P( X 2 | C ) P( X 5 | C )
What if we have seen no training cases where
patient had no flu and muscle aches?
N ( X 5 t , C nf )
ˆ
P( X 5 t | C nf )
0
N (C nf )
Zero probabilities cannot be conditioned away, no
matter the other evidence!
arg max c Pˆ (c)i Pˆ ( xi | c)
Smoothing to Eliminate Zeros
Pˆ ( xi | c j )
N ( X i xi , C c j ) 1
N (C c j ) k
# of values of X,
Add one smooth (Laplace smoothing)
As a uniform prior (each attribute occurs once
for each class) that is then updated as evidence
from the training data comes in.
Document Generative Model
“Love is patient, love is kind. “
Basic : bag of words
Love is kind
a binary independence model
patient
Multivariate
binomial generation
feature Xi is term
Value Xi = 1 or 0, indicate term Xi present in
doc or not
a multinomial unigram language model
Multinomial
generation
feature Xi is term position
Value of Xi = term at position i
Bernoulli Naive Bayes Classifiers
Multivariate binomial Model
feature Xw for each word in dictionary
Xw = true in document d if w appears in d
Naive Bayes assumption:
One
Given the document’s topic, appearance of one
word in the document tells us nothing about
chances that another word appears
Multinomial Naive Bayes Classifiers II
Multinomial = Class conditional unigram
feature Xi for each word pos in
document
One
feature’s values are all words in dictionary
of Xi is the word in position i
Naïve Bayes assumption:
Value
Given the document’s topic, word in one position
in the document tells us nothing about words in
other positions
cNB argmax P(c j ) P( xi | c j )
c jC
i
argmax P(c j ) P( x1 " our" | c j ) P( xn " text" | c j )
c jC
Multinomial Naive Bayes Classifiers
Still too many possibilities
Assume that classification is independent of
the positions of the words
Second assumption:
Word
appearance does not depend on position
P( X i w | c) P( X j w | c)
for all positions i,j, word w, and class c
Just
have one multinomial feature predicting
for all words
Use same parameters for each position
Parameter estimation
Binomial model:
Pˆ ( X w t | c j )
fraction of documents of topic cj
in which word w appears
Multinomial model:
Pˆ ( X i w | c j )
fraction of times in which
word w appears
across all documents of topic cj
Can create a mega-document for topic j by
concatenating all documents in this topic
Naive Bayes algorithm (Multinomial
model)
Naive Bayes algorithm (Bernoulli model)
NB Example
c(5)=?
Multinomial NB Classifier
Feature likelihood estimate
Posterior
Result: c(5) = China
Bernoulli NB Classifier
Feature likelihood estimate
Posterior
Classification
Multinomial vs Multivariate binomial?
Multinomial
is in general better
Feature Selection
统计机器学习与数据挖掘技术与方法研讨班讲义
Feature Selection: Why?
Text collections have a large number of
features
10,000
– 1,000,000 unique words … and
more
May make using a particular classifier
feasible
Some
classifiers can’t deal with 100,000 of
features
Reduces training time
Training
time for some methods is quadratic
or worse in the number of features
Can improve generalization
Feature selection: how?
Two idea:
Hypothesis
Are we confident that the value of one categorical variable is
associated with the value of another
Chi-square test
Information
testing statistics:
theory:
How much information does the value of one categorical variable
give you about the value of another
Mutual information
They’re similar, but 2 measures confidence in association,
(based on available statistics), while MI measures extent of
association (assuming perfect knowledge of probabilities)
Pearson's chi-square test
Example: Is six-sided die
"fair“?
null hypothesis : six-sided
die "fair“ is fair. there will
be an theoretical frequency
of possible outcome
Chi-square is calculated by
n
Oi Ei 2
i 1
Ei
2
Oi observed frequency
Ei theoretical frequency
2 statistic (CHI)
Term = jaguar
Class = auto
Class auto
2 (0.25)
3 (4.75)
Term jaguar
500
expected: fe
(502)
9500 (9498)
observed: fo
2
2
2
(
j
,
a
)
(
O
E
)
/
E
(
2
.
25
)
/
.
25
(
3
4
.
75
)
/
4
.
75
2
2
2
(
500
502
)
/
502
(
9500
9498
)
/
949
12
.
9
(
p
.
0
)
null hypothesis: event term, class are
independent
The null hypothesis is rejected with
Example
the class poultry and the term export
The counts of the number of documents
with the four possible combinations
2 statistic (CHI)
There is a simpler formula for 2x2 2:
A = #(t,c)
C = #(¬t,c)
B = #(t,¬c)
D = #(¬t, ¬c)
N=A+B+C+D
Feature selection via Mutual Information
In training set, choose k words which best
discriminate (give most info on) the
categories.
MI measures how much information the
presence /absence of a term contributes to
making the correct classification decision
on c.
The Mutual Information between a word,
class is:
Example
Feature selection via MI (contd.)
For each category we build a list of k most
discriminating terms.
For example (on 20 Newsgroups):
sci.electronics:
circuit, voltage, amp, ground,
copy, battery, electronics, cooling, …
rec.autos: car, cars, engine, ford, dealer,
mustang, oil, collision, autos, tires, toyota, …
Greedy: does not account for correlations
between terms
Why?
Feature Selection
Mutual Information
Clear
information-theoretic interpretation
Chi-square
Statistical
foundation
May select more rare terms than MI
Just use the commonest terms?
No
particular foundation
In practice, this is often 90% as good
Feature selection for NB
In general feature selection is necessary
for binomial NB.
Otherwise you suffer from noise, multicounting
“Feature selection” really means
something different for multinomial NB. It
means dictionary truncation
The
multinomial NB model only has 1 feature
K-Nearest Neighbors
统计机器学习与数据挖掘技术与方法研讨班讲义
Recall: Vector Space Representation
Each document is a vector, one
component for each term (= word).
Normalize to unit length.
High-dimensional vector space:
Terms
are axes
10,000+ dimensions, or even 100,000+
Docs are vectors in this space
Classes in a Vector Space
Government
Science
Arts
Classification Using Vector Spaces
Each training doc a point (vector) labeled
by its topic (= class)
Hypothesis: docs of the same class form
a contiguous region of space
We define surfaces to delineate classes in
space
Test Document = Government
Similarity
hypothesis
true in
general?
Government
Science
Arts
k Nearest Neighbor Classification
To classify document d into class c
Define k-neighborhood N as k nearest
neighbors of d
Count number of documents i in N that
belong to c
Estimate P(c|d) as i/k
Choose as class argmaxc P(c|d) [ =
majority class]
Example: k=6 (6NN)
P(science| )?
Government
Science
Arts
Nearest-Neighbor Learning Algorithm
Learning is just storing the representations
of the training examples in D.
Testing instance x:
Compute
similarity between x and all
examples in D.
Assign x the category of the most similar
example in D.
Does not explicitly compute a
generalization or category prototypes.
Also called:
Case-based
learning
Memory-based learning
Why K?
Using only the closest example to
determine the categorization is subject to
errors due to:
A single
atypical example.
Noise (i.e. error) in the category label of a
single training example.
More robust alternative is to find the k
most-similar examples and return the
majority category of these k examples.
Value of k is typically odd to avoid ties; 3
and 5 are most common.
kNN decision boundaries
Boundaries
are in
principle
arbitrary
surfaces –
but usually
polyhedra
Government
Science
Arts
Similarity Metrics
Nearest neighbor method depends on a
similarity (or distance) metric.
Simplest
for continuous m-dimensional
instance space is Euclidean distance.
Simplest for m-dimensional binary instance
space is Hamming distance (number of
feature values that differ).
For text, cosine similarity of tf.idf weighted
vectors is typically most effective.
Illustration of 3 Nearest Neighbor for
Text Vector Space
Nearest Neighbor with Inverted Index
Naively finding nearest neighbors requires
a linear search through |D| documents in
collection
But determining k nearest neighbors is the
same as determining the k best retrievals
using the test document as a query to a
database of training documents.
Use standard vector space inverted index
methods to find the k nearest neighbors.
Testing Time: O(B|Vt|)
where B is the
average number of training documents in which a
test-document word appears.
kNN: Discussion
No feature selection necessary
Scales well with large number of classes
Don’t
need to train n classifiers for n classes
Classes can influence each other
Small
changes to one class can have ripple
effect
Scores can be hard to convert to
probabilities
No training necessary
Bias vs. variance:
Choosing the correct model
capacity
kNN vs. Naive Bayes
Bias/Variance tradeoff
Variance
kNN has high variance and low bias.
Infinite
≈ Capacity
memory
NB has low variance and high bias.
Decision
surface has to be linear (hyperplane)
Summary
Categorization
Training
data
Over-fitting & Generalize
Naïve Bayes
Bayesian
n
Oi Ei 2
i 1
Ei
2
Bernoulli NB classifier
Multinomial NB classifier
K-Nearest Neighbor
Bias
cNB argmax P(c j ) P( xi | c j )
c jC
i
Methods
.vs. Variance
Feature selection
Chi-square
test
argmax P(c j ) P( x1 " our" | c j ) P( xn " text" | c j )
Mutual Information
c jC
Readings
[1] IIR Ch13, Ch14.2
[2] Y. Yang and X. Liu, "A re-examination
of text categorization methods," presented
at Proceedings of ACM SIGIR Conference
on Research and Development in
Information Retrieval (SIGIR'99), 1999.
Classification
Evaluation
统计机器学习与数据挖掘技术与方法研讨班讲义
Evaluation: Classic Reuters Data Set
Most (over)used data set
21578 documents
9603 training, 3299 test articles
(ModApte split)
118 categories
An
article can be in more than one
category
Common categories
Learn 118 (#train,
binary
category distinctions
#test)
• Earn (2877, 1087)about
• Trade
(369,119)
Average document:
90
types,
• Acquisitions (1650, 179) • Interest (347, 131)
200 tokens• Money-fx (538, 179) • Ship (197, 89)
• Grain (433, 149)
• Wheat (212, 71)
• Crude (389,of
189)classes
• Cornassigned
(182, 56)
Average number
Reuters Text Categorization data set
(Reuters-21578) document
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE>
kicks off
CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states
determining industry positions on a number of issues, according to the National Pork Producers
Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector.
The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all
areas of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>
New Reuters: RCV1: 810,000 docs
Top topics in Reuters RCV1
北大天网: 中文网页分类
通过动员不同专业的几十个学生,人工选
取形成了一个基于层次模型的大规模中文
网页样本集。
包括12,336个训练网页实例和3,269个测试
网页实例,分布在12个大类,共计733个类
别中,每个类别平均有17个训练实例和4.5
个测试实例
天网免费提供网页样本集给有兴趣的同行,
燕穹产品号:YQ-WEBENCH-V0.8
中文信息检索论坛www.cwirf.org
全国搜索引擎和网上信息挖掘学术研讨会
SEWM上进行分类评测
北大天网: 中文网页分类
类别名称
24
训练
样本数
419
测试
样本数
110
新闻与媒体
7
125
19
3
商业与经济
48
839
214
4
娱乐与休闲
88
1510
374
5
计算机与因特网
58
925
238
6
教育
18
286
85
7
各国风情
53
891
235
8
自然科学
113
1892
514
9
政府与政治
18
288
84
10
社会科学
104
1765
479
11
医疗与健康
136
2295
616
12
社会与文化
66
1101
301
共计
733
12336
3269
类别编
号
1
人文与艺术
2
类别数
Measuring Classification Figures of Merit
Accuracy of classification
Main
evaluation criterion in academia
More in a moment
Speed of training statistical classifier
Some
methods are very cheap; some very
costly
Speed of classification (docs/hour)
No
big differences for most algorithms
Exceptions: kNN, complex preprocessing
requirements
Effort in creating training set/hand-built
classifier
Measuring Classification Figures of Merit
In the real world, economic measures:
Your
Do no classification
Has an easy to compute cost if doing it like that now
Do it all with an automatic classifier
That has a cost (hard to compute)
Do it all manually
choices are:
Mistakes have a cost
Do it with a combination of automatic
classification and manual review of
uncertain/difficult/”new” cases
Commonly
the last method is most cost
efficient and is adopted
Concept Drift
Categories change over time
Example: “president of the united states”
1999:
clinton is great feature
2002: clinton is bad feature
One measure of a text classification
system is how well it protects against
concept drift.
Feature selection: can be bad in protecting
against concept drift
Per class evaluation measures
Actual Class
A
B
C
A
Predicted B
class
C
cii
Recall: Fraction of docs in class i
j cij
classified correctly:
Precision: Fraction of docs assigned class cii
c ji
i that are actually about class i:
j
“Correct rate”: (1- error rate) Fraction of
i cii
docs classified correctly:
j
i
cij
Measures of Accuracy
Evaluation must be done on test data that
is independent of the training data
Overall error rate
Not
a good measure for small classes. Why?
Precision/recall for classification decisions
F1 measure: 1/F1 = ½ (1/P + 1/R)
Correct estimate of size of category
Why
is this different?
Stability over time / category drift
Utility
Costs
of false positives / false negatives may
Good practice department:N-Fold
Cross-Validation
Results can vary based on sampling error due to
different training and test sets.
Average results over multiple training and test sets (splits
of the overall data) for the best results.
Ideally, test and training sets are independent on each
trial.
But
this would require too much labeled data.
Partition data into N equal-sized disjoint segments.
Run N trials, each time using a different segment of the
data for testing, and training on the remaining N1
segments.
This way, at least test-sets are independent.
Report average classification accuracy over the N trials.
Typically, N = 10.
Good practice department: Learning Curves
In practice, labeled data is usually rare
and expensive.
Would like to know how performance
varies with the number of training
instances.
Learning curves plot classification
accuracy on independent test data (Y axis)
versus number of training examples (X
axis).
One can do both the above and produce
learning curves averaged over multiple
trials from cross-validation
how to combine multiple measures?
If we have more than one class, how do
we combine multiple performance
measures into one quantity?
Macroaveraging:
Compute performance for each class, then
average.
Microaveraging:
Collect decisions for all classes, compute
contingency table, evaluate.
Micro- vs. Macro-Averaging: Example
Class 1
Truth:
yes
Class 2
Truth:
no
Truth:
yes
Classi 10
fier:
yes
10
Classifi 90
er: yes
Classi 10
fier:
no
970
Classifi 10
er: no
Micro.Av. Table
Truth:
no
10
890
Truth:
yes
Truth:
no
Classifi
er: yes
100
20
Classifi
er: no
20
1860
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7
Microaveraged precision: 100/120 = .83
Why this difference?