Transcript kNN

kNN: A Non-parametric Classification
and Prediction Technique
Goals of this set of transparencies:
1. Introduce kNN---a popular non-parameric technique
2. Illustrate differences between parametric
and not parametric techniques
Later:
1. Non-Parametric Density Estimation Techniques
2. Editing and Condensing Techniques to Enhance kNN
Eick: kNN
Classification and Decision Boundaries

Classification can be viewed as “learning good
decision boundaries” that separate the examples
belonging to different classes in a data set.
Decision boundary
Eick: kNN
Problems with Parametric Techniques



Parametric approaches assume that the type of
model is known before hand, which is not realistic
for many application.
The types of models of parametric approaches are
“kind of simplistic”. If the characteristics of the data
do no match the assumptions of the underlying
model unreliable predictions are obtained.
Non-parametric approaches—key ideas:
– “Let the data speak for themselves”
– “Predict new cases based on similar cases”
– “Use multiple local models instead of a single
global model”
Eick: kNN
Instance-Based Classifiers
Set of Stored Cases
Atr1
……...
AtrN
Class
A
• Store the training records
• Use training records to
predict the class label of
unseen cases
B
B
C
A
Unseen Case
Atr1
……...
AtrN
C
B
Eick: kNN
Instance Based Classifiers


Instance-based Classifiers: do not create a model
but use training examples directly to classify
unseen examples (“lazy” classifiers).
Examples:
– Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly

– Nearest neighbor
Uses k “closest” points (nearest neighbors) for
performing classification

Eick: kNN
kNN— k Nearest-Neighbor Classifiers
Unknown record

Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Eick: kNN
Definition of Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
Eick: kNN
Voronoi Diagrams for NN-Classifiers
Each cell contains one
sample, and every
location within the cell is
closer to that sample than
to any other sample.
A Voronoi diagram divides
the space into such cells.
Every query point will be assigned the classification of the sample within that
cell. The decision boundary separates the class regions based on the 1-NN
decision rule.
Knowledge of this boundary is sufficient to classify new points.
Remarks: Voronoi diagrams can be computed in lower dimensional spaces; in
feasible for higher dimensional spaced. They also represent models for clusters
that have been generate by representative-based clustering algorithms.
Eick: kNN
Nearest Neighbor Classification

Compute distance between two points:
– Euclidean distance
d ( p, q ) 

 ( pi
i
q )
2
i
Determine the class from nearest neighbor list
– take the majority vote of class labels among the
k-nearest neighbors
– Weigh the vote according to distance

weight factor, w = 1/d2
Eick: kNN
Nearest Neighbor Classification…

Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
X
Eick: kNN
Voronoi Diagrams for NN-Classifiers
Each cell contains one
sample, and every
location within the cell is
closer to that sample than
to any other sample.
A Voronoi diagram divides
the space into such cells.
Every query point will be assigned the classification of the sample within that
cell. The decision boundary separates the class regions based on the 1-NN
decision rule.
Knowledge of this boundary is sufficient to classify new points.
Remarks: Voronoi diagrams can be computed in lower dimensional spaces; in
feasible for higher dimensional spaced. They also represent models for clusters
that have been generate by representative-based clustering algorithms.
Eick: kNN
K-NN:More Complex Decision Boundaries
Eick: kNN
Nearest Neighbor Classification…

Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
height of a person may vary from 1.5m to 1.8m
 weight of a person may vary from 90lb to 300lb
 income of a person may vary from $10K to $1M

Eick: kNN
Summary Nearest Neighbor Classifiers






k-NN classifiers are lazy learners
– Unlike eager learners such as decision tree induction and
rule-based systems, it does not build models explicitly
– Classifying unknown records is relatively expensive
Rely on local knowledge (“let the data speak for themselves”)
and not on global models to make decisions.
k-NN classifiers rely on a distance function; the quality of the
distance function is critical for the performance of a K-NN
classifier.
Capable to create quite complex decision boundaries which
consists of edges of the Voronoi diagram.
K-NN classifiers never actually compute decision boundaries,
in contrast to decision trees/SVMs.
k-NN classifiers obtain high accuracies and are quite popular
in some fields, such as text data mining and in information
retrieval, in general.
Eick: kNN