MathReview - Computer Engineering
Download
Report
Transcript MathReview - Computer Engineering
Pattern Recognition
Mathematic Review
Hamid R. Rabiee
Jafar Muhammadi
Ali Jalali
Probability Space
A triple of (Ω, F, P)
Ω represents a nonempty set, whose elements are sometimes known as
outcomes or states of nature
F represents a set, whose elements are called events. The events are subsets of
Ω. F should be a “Borel Field”.
P represents the probability measure.
Fact: P(Ω) = 1
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
2
http://ce.sharif.edu/courses/85-86/1/ce725/
Random Variables
Random variable is a “function” (“mapping”) from a set of possible
outcomes of the experiment to an interval of real (complex) numbers.
In other words:
Outcomes
ìïï F Í W ïìï X : F ® I
: í
í
ïîï I Í R ïï X (b ) = r
î
Real Line
Examples:
Mapping faces of a dice to the first six natural numbers.
Mapping height of a man to the real interval (0,3] (meter or something else).
Mapping success in an exam to the discrete interval [0,20] by quantum 0.1
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
3
http://ce.sharif.edu/courses/85-86/1/ce725/
Random Variables (Cont’d)
Random Variables
Discrete
Dice, Coin, Grade of a course, etc.
Continuous
Temperature, Humidity, Length, etc.
Random Variables
Real
Complex
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
4
http://ce.sharif.edu/courses/85-86/1/ce725/
Density/Distribution Functions
Probability Mass Function (PMF)
Discrete random variables
Summation of impulses
The magnitude of each impulse represents the probability of occurrence of the
outcome
PMF values are probabilities.
PX
Example I:
Rolling a fair dice
1
6
1
2
3
4
6
5
X i
6
Sharif University of Technology, Department of Computer Engineering,1Pattern Recognition Course
5
PMF
http://ce.sharif.edu/courses/85-86/1/ce725/ 6
i 1
X
Density/Distribution Functions (Cont’d)
Example II:
Summation of two fair dices
PX
1
6
2
3
4
5
6
7
8
9
10 11 12
X
Note : Summation of all probabilities should be equal to ONE. (Why?)
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
6
http://ce.sharif.edu/courses/85-86/1/ce725/
Density/Distribution Functions (Cont’d)
Probability Density Function (PDF)
Continuous random variables
The probability of occurrence of
æ dx
dx ö
÷
÷ will be P (x ).dx
x 0 Î ççx ,x +
÷
çè
2
2ø
PX
Px
x
X
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
7
http://ce.sharif.edu/courses/85-86/1/ce725/
Famous Density Functions
Some famous masses and densities
Uniform Density
f (x ) =
PX
1
.(U (end )- U (begin ))
a
1
a
X
a
Gaussian (Normal) Density
f (x ) =
1
-
s . 2p
e
PX
2
(x - m)
2.s
2
= N (m, s )
1
. 2
Exponential Density
f (x ) = l .e
- lx
ìï l .e - l x
.U (x ) = ïí
ïï 0
î
x ³ 0
x <0
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
8
http://ce.sharif.edu/courses/85-86/1/ce725/
X
Distribution Measures
Most basic type of descriptor for spatial distributions, include:
Mean Center
Variance & Standard Deviation
Covariance
Correlation Coefficient
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
9
http://ce.sharif.edu/courses/85-86/1/ce725/
Expected Value
Expected value ( population mean value)
E[g ( X )] =
å
xf ( x) =
ò
¥
- ¥
xf ( x)dx
Properties of Expected Value
The expected value of a constant is the constant itself. E[b]= b
If a and b are constants, then E[aX+ b]= a E[X]+ b
If X and Y are independent RVs, then E[XY]= E[X ]* E[Y]
If X is RV with PDF f(X), g(X) is any function of X, then,
if X is discrete
E[g ( X )] =
å
g ( X ) f ( x)
if X is continuous
E[ g ( X )] =
ò
¥
- ¥
g ( X ) f ( x )dx
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
10
http://ce.sharif.edu/courses/85-86/1/ce725/
Variance & Standard Deviation
Let X be a RV with E(X)=u, the distribution, or spread, of the X values around the
expected value can be measured by the variance ( dx is the standard deviation of X).
var(X ) = dx2 = E (X - m) 2
=
å
(X - m) 2 f (x )
x
= E (X 2 ) - m2 = E (X 2 ) - [E (X )]2
The Variance Properties
The variance of a constant is zero.
If a and b are constants, then var(aX + b ) = a 2 var(X )
If X and Y are independent RVs, then var(X + Y ) = var(X ) + var(Y )
var(X - Y ) = var(X ) + var(Y )
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
11
http://ce.sharif.edu/courses/85-86/1/ce725/
Covariance
Covariance of two RVs, X and Y: Measurement of the nature of the association
between the two.
Cov (X ,Y ) = E [(X - mx )(Y - my )] = E (XY ) - mx my
Properties of Covariance:
If X, Y are two independent RVs, then Cov = E (XY ) - m m = E (x )E ( y ) - m m = 0
x y
x y
If a, b, c and d are constants, then Cov (a + bX ,c + dY ) = bd cov(X ,Y )
If X is a RV, then Cov = E (X 2 ) - mx 2 = Var (X )
Covariance Value
Cov(X,Y) is positively big = Positively strong relationship between the two.
Cov(X,Y) is negatively big = Negatively strong relationship between the two.
Cov(X,Y)=0 = No relationship between the two.
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
12
http://ce.sharif.edu/courses/85-86/1/ce725/
Covariance Matrices
If X is a n-Dim RV, then the covariance defined as:
E [(x μ x )(x μ x )t ]
whose ijth element ij is the covariance of xi and xj:
Cov (x i , x j ) ij E [(x i i )(x j j )],
then
11 12
22
21
d 1 d 2
1d 12 12
2d 21 22
dd
d 1 d 2
i , j 1,
1d
2d
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
13
http://ce.sharif.edu/courses/85-86/1/ce725/
d2
,d .
Covariance Matrices (cont’d)
Properties of Covariance Matrix:
If the variables are statistically independent, the covariances are zero, and the covariance
matrix is diagonal.
12 0
0 22
0
0
1/ 12
0
0
0
0
1/ 22
1
d2
0
0
0
0
1/ d2
2
x
t
1
(x μ) (x μ)
noting that the determinant of S is just the product of the variances, then we can write
p (x )
1
(2 )
d /2
1/ 2
e
1
( x μ )t 1 ( x μ )
2
This is the general form of a multivariate normal density function, where the covariance
matrix is no longer required to be diagonal.
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
14
http://ce.sharif.edu/courses/85-86/1/ce725/
Correlation Coefficient
Correlation: Knowing about a random variable “X”, how much
information will we gain about the other random variable “Y” ?
The population correlation coefficient is defined as
cov(X ,Y )
x y
The Correlation Coefficient is a measure of linear association between two
variables and lies between -1 and +1
-1 indicating perfect negative association
+1 indicating perfect positive association
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
15
http://ce.sharif.edu/courses/85-86/1/ce725/
Moments
Moments
n
nth order moment of a RV X is the expected value of X :
M n = E (X n )
Normalized form (Central Moment)
(
n
M n = E (X - mX )
)
Mean is first moment
Variance is second moment added by the square of the mean.
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
16
http://ce.sharif.edu/courses/85-86/1/ce725/
Moments (cont’d)
Third Moment
Measure of asymmetry
Often referred to as skewness
s
(x ) f (x )dx
3
If symmetric s= 0
Fourth Moment
Measure of flatness
Often referred to as Kurtosis
k
(x ) f (x )dx
4
small kurtosis
large kurtosis
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
17
http://ce.sharif.edu/courses/85-86/1/ce725/
Sample Measures
Sample: a random selection of items from a lot or population in order to evaluate
the characteristics of the lot or population
x i 1 x i
n
Sample Mean:
2
(
X
X
)
Sample Variance:
S i 1
n 1
(X i X )(Y i Y )
Sample Covariance: Cov (X ,Y )
n 1
n
2
x
Sample Correlation: Corr
3th center moment
4th center moment
(X
i
(X X )
i 1
n 1
4
n (X X )
i 1 n 1
n
3
X )(Y i Y ) /(n 1)
SxSy
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
18
http://ce.sharif.edu/courses/85-86/1/ce725/
More on Gaussian Distribution
The Gaussian Distribution Function
Sometimes called “Normal” or “bell shaped”
Perhaps the most used distribution in all of science
Is fully defined by 2 parameters
95% of area is within 2σ
Normal distributions range from minus infinity to plus infinity
PX
f (x ) =
1
. 2
1
s . 2p
-
e
Shaded
2
(x - m)
2.s
2
area
= N (m, s )
gives
0.4
prob .
2
0.3
0.2
0.1
X
-4
-3
-2
-1
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
19
http://ce.sharif.edu/courses/85-86/1/ce725/
1
2
3
Gaussian with =0 and =1
4
More on Gaussian Distribution (Cont’d)
Normal Distributions with the Same Variance but Different Means
Normal Distributions with the Same Means but Different Variances
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
20
http://ce.sharif.edu/courses/85-86/1/ce725/
More on Gaussian Distribution (Cont’d)
Standard Normal Distribution
A normal distribution whose mean is zero and standard deviation is one is called the
standard normal distribution.
f (x ) =
1
2p
e
-
1 2
x
2
any normal distribution can be converted to a standard normal distribution with simple
algebra. This makes calculations much easier.
Z =
X - m
s
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
21
http://ce.sharif.edu/courses/85-86/1/ce725/
Central Limit Theorem
Why is The Gaussian PDF is so applicable? Central Limit Theorem
Illustrating CLT
a) 5000 Random Numbers
b) 5000 Pairs (r1+r2) of Random Numbers
c) 5000 Triplets (r1+r2+r3) of Random Numbers
d) 5000 12-plets (r1+r2+…+r12) of Random Numbers
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
22
http://ce.sharif.edu/courses/85-86/1/ce725/
Central Limit Theorem
Central Limit Theorem says that
we can use the standard normal approximation of the sampling distribution
regardless of our data. We don’t need to know the underlying probability
distribution of the data to make use of sample statistics.
This only holds in samples which are large enough to use the CLT’s “large
sample properties.” So, how large is large enough?
Some books will tell you 30 is large enough.
Note: The CLT does not say: “in large samples the data is distributed normally.”
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
23
http://ce.sharif.edu/courses/85-86/1/ce725/
Two Normal-like distributions
T-Student Distribution
é t2ù
G éêë(v + 1)/ 2ù
ú
û
ê1 + ú
f (t ) =
v p G(v / 2) êë v ú
û
- (v + 1) / 2
Chi-Squared Distribution
f (c 2 ) =
1
1
2 (v / 2)- 1 - c 2 / 2
(
c
)
e
G(v / 2) 2v / 2
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
24
http://ce.sharif.edu/courses/85-86/1/ce725/
Distances
Each clustering problem is based on some kind of “distance” between points.
A Euclidean space has some number of real-valued dimensions and “dense” points.
There is a notion of “average” of two points.
A Euclidean distance is based on the locations of points in such a space.
A Non-Euclidean distance is based on properties of points, but not their “location” in a
space.
Distance Matrices
Once a distance measure is defined, we can calculate the distance between objects.
These objects could be individual observations, groups of observations (samples) or
populations of observations.
For N objects, we then have a symmetric distance matrix D whose elements are the
distances between objects i and j.
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
25
http://ce.sharif.edu/courses/85-86/1/ce725/
Axioms of a Distance Measure
d is a distance measure if it is a function from pairs of points to reals such
that:
1. d(x,y) > 0.
2. d(x,y) = 0 iff x = y.
3. d(x,y) = d(y,x).
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
For the purpose of clustering, sometimes the distance (similarity) is not
required to be a metric
No Triangle Inequality
No Symmetry
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
26
http://ce.sharif.edu/courses/85-86/1/ce725/
Distance Measures
Euclidean Distance
L2 Norm
L1, L∞ Norm
Mahalanobis Distance
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
27
http://ce.sharif.edu/courses/85-86/1/ce725/
Euclidean Distance (L2 Norm)
A possible distance measure for spaces equipped with a Euclidean metric
d ij
d ij
x
2
k 1
x
ik
p
k 1
ik
x jk
x jk
( x j1 , x j 2 )
x j2
2
x2
2
d ij
xi 2
Multivariate distances between populations
( xi1 , xi 2 )
xi1
x1
x j1
Calculates the Euclidean distance between two “points” defined by the
multivariate means of two populations based on p variables.
Does not take into account differences among populations in within-population
variability nor correlations among variables.
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
28
http://ce.sharif.edu/courses/85-86/1/ce725/
Euclidean Distance (L2 Norm) (Cont’d)
Multivariate distances between populations
d ij
p
k 1
ik
jk
12 22
2
d 12
X2
11 21
Sample 2
Sample 1
X1
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
29
http://ce.sharif.edu/courses/85-86/1/ce725/
Euclidean Distance (L1, L∞ Norm)
L1 Norm: sum of the differences in each dimension.
Manhattan distance = distance if you had to travel along coordinates only.
L∞ norm : d(x,y) = the maximum of the differences between x and y in
any dimension.
y = (9,8)
5
4
x = (5,5)
3
L1-norm:
dist(x,y) = 4+3 = 7
L∞ -norm:
dist(x,y) = Max(4, 3) = 4
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
30
http://ce.sharif.edu/courses/85-86/1/ce725/
Mahalanobis Distance
Distances are computed based on means, variances and covariances for
each of g samples (populations) based on p variables.
Mahalanobis distance “weights” the contribution of each pair of variables
by the inverse of their covariance.
D ij2 ( i j )T S 1 ( i j )
i , j are samples means.
S is the Covariance between samples.
Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course
31
http://ce.sharif.edu/courses/85-86/1/ce725/