Basic principles of probability theory

Download Report

Transcript Basic principles of probability theory

Basics of discriminant analysis
•
•
•
•
•
Purpose
Various situations
Examples
R commands
Demonstration
Purpose
Assume that we have several groups and we know that
our observations must belong to one of these groups.
For example we may know that there are several
diseases and by symptoms we want to decide which
disease we are dealing with. Or we may have several
species of plants. When we observe various
characteristics of some specie we want to know to
which specie it belongs to.
We want to divide our space into regions and when we
observe an observation then we decide which region it
belongs to. Each region is assigned to one of the
classes. If an observation belongs to the region
number k then we say that this observation belongs to
class number k.
In the picture we have 3 regions. If an observation
belongs to the region 1 then we decide that it is a
member of the class 1.
Discriminant analysis is widely used in many fields. For
example it is an integral part of neural networks.
2D example
3
1
2
Various situations
There can be several situations:
1)
We know the distribution for each class (it is an unrealistic assumption). Then the
problem becomes easy. If we have an observation then calculate probability of
this observation using formula for each class. Whichever has the maximum value
that wins.
2)
We know the form of the distributions but we do not know their parameters. For
example we may know that distribution for each class is normal but we do not
know mean and variances for these distributions. Then we need to have
representatives for each class. Once we have representatives we can estimate
parameters of the distributions (mean and variance for the normal case). When we
have new observation we can use these parameters as true parameters and
calculate probabilities. Again the largest probability wins
3)
When we have prior probabilities. E.g. in the case of diseases we may know that
one of them has prior probability of 0.7 and another one may have prior
probability 0.3. In this case we can use these probabilities when we calculate
probability of the observation by simple multiplications
Various situations: Unknown parameters
If we know that probability distributions are normal then we have two cases
1)
Variances of these distributions are same
In this case space is divided by hyperplanes. In one dimensional case with two classes we
have one point that divides line into two regions. This point is in the middle of means
for two distributions. In two dimensional case with two classes we have a line that
divides space into two regions. This lines intersects line segment joining to means of
distributions in the middle. In three dimensional space we will have planes.
2)
Variances are different
In this case we will have shapes defined by quadratic forms that divide space into regions. In
one dimensional case we will have two points. In two dimensional case we may have
ellipse, hyperbola, parabola, two lines. Form of these lines are dependent on the
differences of variances. In three dimensional case we can have ellipsoid, hyrperboloid
etc.
Maximum likelihood discriminant
analysis
Let us assume that we have g populations (groups). Each of the population has the
probability distribution Li(x). Then for an observation likelihood of all populations is
calculated and the population with the largest likelihood is taken. If two of the
populations have the same likelihood then one of them can be chosen. Let us assume
we are dealing with one dimensional populations and their distributions are normal.
Moreover let us assume that we have only two populations then we will have
L1(x)  L2 (x)  (x  1) 2 /(212 )  log(1 )  (x  2 ) 2 /(2 22 )  log( 2 ) 
1 1
12 22

x ( 2  2 )  2x(  )  ( 2  2 )  2log( 2 )
1  1
1 1
1  2
1
2
1
1
This quadratic inequality divides real numbers line into two regions. When this
inequality is satisfied then the observation belongs to the class 1 otherwise it belongs
to the class 2. When variances are equal then we have a linear inequality. Then if 1

>  2 and x>(1+ 2)/2 then this rule puts x into group 1.
Multidimensional cases are similar to one dimensional cases except inequalities are
multidimensional. When variances are equal then the space is divided by a
hyperplane (line in two dimensional case)
If parameters of the distributions are not known they are calculated using given
observations
Distributions with equal and known variances: 1D
example
The probability distributions for classes
are known. They are normal.
Variances for both of them are 1.
One of them has mean value 5 and
another one has 8. Anything below
6.5 belongs to class 1 and anything
above 6.5 belongs to class 2.
Observation with value 6.5 can
belong to both classes
The observations a and b will be
assigned to class 1 and the
observations c and d will be assigned
to class 2. Anything smaller than the
middle of two means will be
assigned to the call 1 and anything
bigger than this value will belong to
class 2.
class 1
class 2
2
1
a
b
c
d
new observations
distrimination point
Distributions with known but different variances:
1D example
Assume that we have two classes.
Probability distributions for both of
them is normal. Means and variances
of distributions are known. One of
the distributions is much sharper than
another one. In this case the
probability of the observation b for
the class 2 is higher than that for the
class 1. Probability of c for the class
1 is higher than for the class 2.
Probability of observation a,
although very small, for the class 1 is
higher than for the class 1. Thus the
observations a, c ,d will be assigned
to the class 1 and the observation b
to the class 2. Very small and large
observation will belong to class 1
and medium observations to class 2.
Interval for class 1
class 2
class 1
a
b
c
d
new observations
Two dimensional example
In two dimensional case we want to divide the
whole plane into two two (or more)
sections. If new observations belong to one
of these regions then we decide its class
number. Red dot is on the region
corresponding to class 1 and Blue dot is on
the region belonging to class 1.
Parameters of the distributions are calculated
using sample points (shown by small black
dots). There are 50 observations for each
class. If it turns out that variances of
distributions are equal then we will have
linear discriminations. If variances would
be unequal then we would have quadratic
discriminations (lines would be quadratic).
Discrimination line
class 1
class 2
new observations
Likelihood ratio discriminant analysis
Likelihood ratio discriminant rule is a technique that puts a given observation to
the group that is being tested and parameters are re-estimated. It is done for
each group. Observation is allocated to a group that has the largest likelihood.
This technique tend to put an observation to a population that has larger sample
size.
Fisher’s discriminant function
Fisher’s discrimination rule maximises the ratio of between groups sum of squares to
within group sum of squares:
a T Ba/a T Wa  max
Where W is the within group sum of squares:

g
n
i
1
W
(x ij  x i )(x ij  x i )T


n  g j1 i1
n is the total number of observations, g is the number of groups, i, ni is the number of
observations in the group i. There are several ways of calculating between groups
sum of squares.One popular way is a weighted sum of squares.
g
1
B
n j (x j  x)(x j  x)T

n  g j1
Then problem of finding discrimination rule reduces to finding maximum eigenvalue
and corresponding eigenvector of the matrix W-1B. New observation x is put into the
group i if the following
inequality holds

| a T x  a T x i || a T x  a T x j | for alli  j
When parameters of distributions are unknown
In general the problem consists of two parts
1)
Classification. At this stage space is divided into regions and each region
belongs to one class. In some sense it means that we need to find a function
or inequalities to divide space into parts. It is done usually by probability
distribution for each class. In a way this stage can be considered as a rule
generation.
2)
Discrimination. Once space has been partitioned or rules have been
generated then using these rules new observations are assigned to classes
Note that if each observation belongs to one class only, then it is a deterministic
rule. There are other rules also. One of them is fuzzy rules. Each
observation has degree of belongness to a class. For example observation
may belong to class 1 with degree equal to 0.7 and to class 2 with degree
0.3.
Probability of misclassification
Let us assume we have g groups (classes). Probability of misclassification is
defined as probality of putting an observation to the class i when it is from the
class j. It is denoted as pij. In particular probability of correct allocation for the
class i is pii and probability of misclassification for this call is 1-pii.
Assume that we have two discriminant rules - d and d’. It is said that discriminant
rule d is as good as d’ if:
pii p’ii for i=1,,,g
d is better than d’ if at least in one of the cases inequality is strict. If there is no
better rule than d then it is called an admissible rule.
In general it may happen that it is not possible to compare two rules. For example
it may happen that p11>p’11 but p22<p’22.
Resampling and misclassification
Resubstitution: Estimate disciminant rule and then for each observation calculate
probability of misclassification. Problem with this technique is that it gives, as
expected, optimistic estimation.
Jacknife: From each class one observation in turn removed, discriminant rule is defined.
Removed observation is predicted. Then probability of misclassification is
calculated using ni1/n1. Where n1 is the number of observation in the first group, ni1
is number of cases when observation from group 1 was classified as belonging to
group i. Similar misclassification probability is calculated for each class.
Bootstrap: Resample the sample of observations. There several techniques that applies
bootstrap. One of them is described here.
First calculate misclassification probabilities using resubstitution. Denote it by eai.
There are two ways: Resample all observations simultaneously or resample each
group (i.e. take a sample of n1 points from the group 1 etc). Then define
discrimination rule and then estimate probabilities of misclassification for bootstrap
sample and for the original sample. Denote them epib and pib. Calculate differences
dib=epib-pib. Repeat it B times and averag. It is the bootstrap bias correction. Then
probability of misclassification is ea -<d>
R commands for discriminant analysis
Commands for discrimnant analyses are in the library MASS. This library should be
loaded.
library(MASS)
Necessary commands are:
lda – linear discriminant analysis. Using given observation this command calculates
discrimination lines (hyperplanes)
qda – quadratic discrimination analysis. This command calculates necessary equations.
It does not assume equality of the variances.
predict – For new observation it makes decision to which class it belongs.
Example of uses:
z = lda(data,grouping=groupings)
predict(z,newobservations)
Similarly for quadratic discriminant analysis
z = qda(data,groupins=groupings)
predict(z,newobservations)$class
data are data matrix given us for discrimination rule calculations. They can be
considered as a data set for training. grooupings defines which observation belongs
to which class.
References
1.
2.
Krzanowski WJ and Marriout FHC. (1994) Multivatiate analysis.
Kendall’s library of statistics
Mardia, K.V. Kent, J.T. and Bibby, J.M. (2003) Multivariate analysis