Methods in Medical Image Analysis Statistics of Pattern

Download Report

Transcript Methods in Medical Image Analysis Statistics of Pattern

Methods in Medical Image
Analysis
Statistics of Pattern Recognition:
Classification and Clustering
Some content provided by Milos Hauskrecht,
University of Pittsburgh Computer Science
ITK Questions?
Classification
Classification
Classification
Features
• Loosely stated, a feature is a value
describing something about your data
points (e.g. for pixels: intensity, local
gradient, distance from landmark, etc)
• Multiple (n) features are put together to
form a feature vector, which defines a
data point’s location in n-dimensional
feature space
Feature Space
• Feature Space – The theoretical n-dimensional space occupied
by n input raster objects (features).
– Each feature represents one dimension, and
its values represent positions along one of the
orthogonal coordinate axes in feature space.
– The set of feature values belonging to a data
point define a vector in feature space.
Statistical Notation
• Class probability distribution:
p(x,y) = p(x | y) p(y)
x: feature vector – {x1,x2,x3…,xn}
y: class
p(x | y): probabilty of x given y
p(x,y): probability of both x and y
Example: Binary Classification
Example: Binary Classification
• Two class-conditional distributions:
p(x | y = 0)
p(x | y = 1)
• Priors:
p(y = 0) + p(y = 1) = 1
Modeling Class Densities
• In the text, they choose to concentrate on
methods that use Gaussians to model class
densities
Modeling Class Densities
Generative Approach to
Classification
1. Represent and learn the distribution:
p(x,y)
2. Use it to define probabilistic discriminant
functions
e.g.
go(x) = p(y = 0 | x)
g1(x) = p(y = 1 | x)
Generative Approach to
Classification
Typical model:
p(x,y) = p(x | y) p(y)
p(x | y) = Class-conditional distributions
(densities)
p(y) = Priors of classes (probability of class y)
We Want:
p(y | x) = Posteriors of classes
Class Modeling
• We model the class distributions as multivariate
Gaussians
x ~ N(μ0, Σ0) for y = 0
x ~ N(μ1, Σ1) for y = 1
• Priors are based on training data, or a distribution
can be chosen that is expected to fit the data well
(e.g. Bernoulli distribution for a coin flip)
Making a class decision
• We need to define discriminant functions ( gn(x) )
• We have two basic choices:
– Likelihood of data – choose the class (Gaussian) that
best explains the input data (x):
– Posterior of class – choose the class with a better
posterior probability:
Calculating Posteriors
• Use Bayes’ Rule:
• In this case,
P( A | B) 
P( B | A) P( A)
P( B)
Linear Decision Boundary
• When covariances are the same
Linear Decision Boundary
Linear Decision Boundary
Quadratic Decision Boundary
• When covariances are different
Quadratic Decision Boundary
Quadratic Decision Boundary
Clustering
• Basic Clustering Problem:
– Distribute data into k different groups such that data
points similar to each other are in the same group
– Similarity between points is defined in terms of some
distance metric
• Clustering is useful for:
– Similarity/Dissimilarity analysis
• Analyze what data point in the sample are close to each
other
– Dimensionality Reduction
• High dimensional data replaced with a group (cluster) label
Clustering
Clustering
Distance Metrics
• Euclidean Distance, in some space (for our
purposes, probably a feature space)
• Must fulfill three properties:
Distance Metrics
• Common simple metrics:
– Euclidean:
– Manhattan:
• Both work for an arbitrary k-dimensional space
Clustering Algorithms
• k-Nearest Neighbor
• k-Means
• Parzen Windows
k-Nearest Neighbor
• In essence, a classifier
• Requires input parameter k
– In this algorithm, k indicates the number of
neighboring points to take into account when
classifying a data point
• Requires training data
k-Nearest Neighbor Algorithm
• For each data point xn, choose its class by
finding the most prominent class among
the k nearest data points in the training set
• Use any distance measure (usually a
Euclidean distance measure)
k-Nearest Neighbor Algorithm
e1
+
-
-
q1
+
+
+
-
1-nearest neighbor:
the concept represented by e1
5-nearest neighbors:
q1 is classified as negative
k-Nearest Neighbor
• Advantages:
– Simple
– General (can work for any distance measure you
want)
• Disadvantages:
– Requires well classified training data
– Can be sensitive to k value chosen
– All attributes are used in classification, even ones that
may be irrelevant
– Inductive bias: we assume that a data point should be
classified the same as points near it
k-Means
• Suitable only when data points have
continuous values
• Groups are defined in terms of cluster
centers (means)
• Requires input parameter k
– In this algorithm, k indicates the number of
clusters to be created
• Guaranteed to converge to at least a local
optima
k-Means Algorithm
•
Algorithm:
1. Randomly initialize k mean values
2. Repeat next two steps until no change in
means:
1. Partition the data using a similarity measure
according to the current means
2. Move the means to the center of the data in the
current partition
3. Stop when no change in the means
k-Means
k-Means
• Advantages:
– Simple
– General (can work for any distance measure you want)
– Requires no training phase
• Disadvantages:
– Result is very sensitive to initial mean placement
– Can perform poorly on overlapping regions
– Doesn’t work on features with non-continuous values (can’t
compute cluster means)
– Inductive bias: we assume that a data point should be classified
the same as points near it
Parzen Windows
• Similar to k-Nearest Neighbor, but instead
of using the k closest training data points,
its uses all points within a kernel (window),
weighting their contribution to the
classification based on the kernel
• As with our classification algorithms, we
will consider a gaussian kernel as the
window
Parzen Windows
• Assume a region defined by a d-dimensional
Gaussian of scale σ
• We can define a window density function:

1
p( x ,  ) 
S
 
 G( x  S ( j ) ,  )
S
j 1
2
• Note that we consider all points in the training
set, but if a point is outside of the kernel, its
weight will be 0, negating its influence
Parzen Windows
Parzen Windows
• Advantages:
– More robust than k-nearest neighbor
– Excellent accuracy and consistency
• Disadvantages:
– How to choose the size of the window?
– Alone, kernel density estimation techniques
provide little insight into data or problems