lecture 3 - pantherFILE

Download Report

Transcript lecture 3 - pantherFILE

Seminar of Interest
Friday, September 15, at 11:00 am, EMS W220.
Dr. Hien Nguyen of the University of WisconsinWhitewater.
"Hybrid User Model for Information Retrieval:
Framework and Evaluation".
Overview of Today’s
Lecture
• Last Time: representing examples (feature
selection) HW0, intro to supervised
learning
• HW0 due on Tuesday
• Today: K-NN wrapup, Naïve Bayes
• Reading Assignment: Section 2.1, 2.2,
Chapter 5
Nearest-Neighbor
Algorithms
(aka. Exemplar models, instance-based learning (IBL),
case-based learning)
• Learning ≈ memorize training examples
• Problem solving = find most similar example
in memory; output its category
Venn
“Voronoi
Diagrams”
(pg 233)
+
+
-
+
-
+
+
+ ?
+
+
…
+
-
-
+
Sample Experimental
Results
Testbed
Wisconsin
Cancer
IBL
98%
Testset Correctness
D-Trees
Neural Nets
95%
96%
Heart Disease 78%
76%
?
Tumor
37%
38%
?
Appendicitis
83%
85%
86%
Simple algorithm works quite well!
Simple Example – 1-NN
(1-NN ≡ one nearest neighbor)
Training Set
1. a=0, b=0,
2. a=0, b=1,
3. a=1, b=1,
Test Example
• a=0, b=1,
c=1 +
c=0 c=1 c=0 ?
“Hamming Distance”
•Ex 1 = 2
So output •Ex 2 = 1
•Ex 3 = 2
K-NN Algorithm
Collect K nearest neighbors, select majority
classification (or somehow combine their classes)
• What should K be?
• Problem dependent
• Can use tuning sets (later) to select a
good setting for K
Tuning Set
Error Rate
1
2
3
4
5
K
What is the “distance”
between two examples?
One possibility: sum the distances between features
distance between
examples 1 and 2
d (e1 , e2 ) 
numeric feature
specific weight
# features
 w * d (e , e )
i 1
i
i
i
1
i
2
distance for feature i only
Using K neighbors to
classify an example
Given: nearest neighbors e1, ..., ek
with output categories O1, ..., Ok
The output for example et is
k
Ot = arg max   (ei , et ) *  (Oi , c)
cpossiblecategories i 1
the kernel
“delta” function
(=1 if Oi=c, else =0)
Kernel Functions
• Term “kernel” comes from statistics
• Major topic for support vector machines
(later)
• Weights interaction between pairs of
examples
• can involve a similarity measure
Kernel function (ei, et)
Examples
In the diagram to the right, the
example ‘?’ has three neighbors,
two of which are ‘-’ and one of
which is ‘+’.
(ei, et) = 1
?
+
-
simple majority vote
(? classified as -)
If (ei, et) =1 / dist(ei, et)
inverse distance
weight (? could
be classified as +)
Gaussian Kernel: popular
in SVMs
distance between
two examples
 (ei , et )  e

Euler’s constant
ei  et
2
2
2
“standard deviation”
y=1/x
y = 1 / x2
y = 1 / exp(x2)
Instance-Based Learning
(IBL) and Efficiency
• IBL algorithms postpone work from
training to testing
• Pure NN/IBL just memorizes the training
data
• Computationally intensive
• Match all features of all training examples
Instance-Based Learning
(IBL) and Efficiency
• Possible Speed-ups
• Use a subset of the training examples
(Aha)
• Use clever data structures (A. Moore)
• KD trees, hash tables, Voronoi diagrams
• Use a subset of the features
• Feature selection
Feature Selection as
Search Problem
• State = set of features
• Start state:
• No feature (forward selection) or
• All features (backward selection)
• Operators = add/subtract features
• Scoring function = acc. on tuning set
Forward and Backward
Selection of Features
• Hill-climbing (“greedy”) search
Forward
Backward
{}
50%
Features to use
...
{F1}
62%
{FN}
71%
...
{F1,F2,...,FN}
73%
Accuracy on
subtract F1
tuning set
(our heuristic
{F2,...,FN}
function)
79%
subtract F2
...
Forward vs. Backward
Feature Selection
Forward
• Faster in early steps
because fewer features
to test
• Fast for choosing a
small subset of the
features
• Misses useful features
whose usefulness
requires other features
(feature synergy)
Backward
• Fast for choosing all
but a small subset of
the features
• Preserves useful
features whose
usefulness requires
other features
• Example: area
important, features =
length, width
Feature Selection and
Machine Learning
Filtering-Based
Feature Selection
all features
FS algorithm
Wrapper-Based
Feature Selection
all
features
FS algorithm
calls ML algorithm
many times, uses
it to help select
features
model
ML algorithm
subset of features
ML algorithm
model
Number of Features and
Performance
• Too many features can hurt test set
performance
• Too many irrelevant features mean
many spurious correlation possibilities
for a ML algorithm to detect
“Vanilla” K-NN Report Card
Learning Efficiency
A+
Classification Efficiency F
Stability
C
Robustness (to noise)
D
Empirical Performance C
Domain Insight
F
Implementation Ease
A
Incremental Ease
A
But is a good baseline
K-NN Summary
• K-NN can be an effective ML algorithm
• Especially if few irrelevant features
• Good baseline for experiments
A Different Approach to
Classification:
Probabilistic Models
• Indicate confidence in classification
• Given feature vector:
F = (f1= v1, … , fn = vn)
• Output probability:
P(class = + | F)
The probability the class is positive given”the feature vector
Probabilistic K-NN
• Output probability using k neighbors
• Possible algorithm:
P(class = + | F) = number of “+” neighbors
k
Bayes’ Rule
• Definitions:
P(A^B)  P(B)*P(A|B)
P(A^B)  P(A)*P(B|A)
• So
A
(assuming P(B) > 0):
P(B)*P(A|B) = P(A)*P(B|A)
P(A|B) = P(A)*P(B|A)
P(B)
Bayes’ rule
B
Conditional Probabilities
• Note the difference:
• P(A|B) is small
• P(B|A) is large
Bayes’ Rule Applied to ML
Shorthand for
P(class = c | f1= v1, … , fn = vn)
P(class | F) =
P(F | class) * P(class)
P(F)
Why do we care about Bayes’ rule?
Because while P(class|F) is typically difficult to
directly measure, the values on the RHS are
often easy to estimate (especially if we make
simplifying assumptions)