Transcript estimation
Samples and Statistical Inference
CS6762
53
Motivation
• Used mainly for modeling the data set
• Why sample
– Some data mining tasks involve only part of population
– Even if a task involves entire population, only a portion
can be dealt with
• Why statistical inference
– Make statements about the entire population structures
(size estimation, prob distribution, our own confidence,
etc)
– The statement is made based on samples
– We must have a model (hypothesis) in mind
– For example: Z = f(X, Y)
CS6762
54
Statistical Inferences
• Samples must be drawn from the population
randomly
– Each member of the population had a particular
probability of appearing in the sample
– Purposed sampling may be away from the true structure
• If the distribution is normal with mean of 0 and SD of 1, what
can you get by putting a lot of 20+ values in the sample?
• Let M be the model and be the parameter (neither
is known to us), and D = {x(1), …, x(n)} be a
random sample (note: x(i) is the ith row)
– p(D|, M) is the probability that D will be selected
n
– If x(i)s are independent, then p( D | , M ) p( x(i) | , M )
i 1
CS6762
55
Statistical Inferences
– called likelihood function if viewed as the function of
– reject model (parameter) if probability of the observed
data arising under that model (parameter) is less than
threshold
• Example:
– determined the data set is Normal distribution with unit variance, but do
not know its mean
– calculate the probability for each value of mean
– reject those values for which the probability is less than the threshold
– or, more frequently, select the value that maximizes the
probability, a special case of estimation
• In the above example, we could choose the value of the mean
that maximizes the probability
CS6762
56
Estimation
• Purpose: making statement about of entire data set
• approach: estimating the parameter(s) using samples
• Notation
– Let be the parameter to be estimated
– Estimator of , denoted as ˆ , is a function of the
observational data in the sample
– Example 1: estimate the average battery life of given type
• Let denote the (true) average
• We randomly take n batteries of that type and observe their
lives, denoted as x1,…,x n
• A reasonable estimator: ̂ = x = (x1+…+x n)/n
CS6762
57
Estimation
• Let n = 3, and x1 = 5.2, x2 = 5.6, x3 = 6.0
• Then we have ̂ = 5.6
– Example 2: Assume the break down voltage of a device
has a normal distribution with mean . Let x1,…,x n be
the observed break down voltages in the sample
– Possible estimator
• x : average of all the values
x : average of the middle two values
• ~
• [min(xi) + max(xi)]/2
• 10% trimmed mean (discard the smallest and largest 10% and
then make an average)
CS6762
58
Estimation (cont.)
– Let n = 10, and the sample be:
24.46 25.61 26.25 26.42 26.66
27.15 27.31 27.54 27.74 27.94
– We have:
• X = x = (xi)/n = 267.1/10 = 26.71
~ ~
• X = x = (26.66+27.15)/2 = 26.91
• [min(xi) + max(xi)]/2 = (24.46+27.94)/2 = 26.2
• X tr (10) = (267.1 – 24.46 – 27.94)/8 = 26.84
– Question: which estimator is better?
CS6762
59
Estimation (cont.)
– Example 3: assume X is the reaction time of some
stimulus, and has uniform distribution [0, ]. Estimate
using random sample x1,…,x n
• Estimator 1, ˆ1 = max (x1,…,x n)
n 1
ˆ
• Estimator 2, 2 = n max (x1,…,x n)
• Let n = 5, and the sample be 4.2, 1.7, 2.4, 3.9, 1.3, then
ˆ1 = 4.2, ˆ2 = 5 1 4.2 = 5.04
5
• Which one is more reasonable?
CS6762
60
Estimation (cont.)
– Example 4: let the random sample x1,…,x n be drawn
from a population with certain probability distribution.
Estimate 2
(X X )
• Estimator 1: ˆ1 =
n
2
(X X )
ˆ
• Estimator 1: 2 =
n 1
2
2
i
2
i
• Let n = 8, and the sample be 44.2, 43.9, 44.7, 44.2, 44.0, 43.8,
44.6, 43.1 then
2
2
ˆ
=
.220,
2 = 0.251
ˆ1
CS6762
61
Estimation (cont.)
– Example 5: Let the sample (x1t,…,x nt) be drawn from a
population of 2D – objects with certain (unknown)
distribution. Estimate the mean and co-variance matrix.
• Note: we have x it = <xi1, xi2> for all i
• Estimator for mean: μ̂ t = < y1 , y 2 > where y1 =
n
x
i 1
i1
n
n
and y 2 =
x
i 1
i2
n
• Estimator for co-variance matrix:
S 2=
n
for i = 1,2, ii =
2
(
x
y
)
ji i
j 1
n
CS6762
11 12
21 22
n
, 12 =
(x
i 1
i1
where
y 1 )( xi 2 y 2 )
n
62
Estimation (cont.)
– Let X be any random sample containing n observations,
then X is a (multivariant) random variable
– To emphasize this, we should write X = (X1,…,X n)
– Any estimator f(X) is a random variable of n arguments
– Thus when we define an estimator, we should use the
notation: f(X) = f(X1,…,X n)
• For instance, in example 1, we write ̂ = X =
X
i
n
• Same for other examples
CS6762
63
Estimation (cont.)
• Let ˆ be an estimator for parameter . We say that
E( ˆ ) - is the bias of ˆ
• If an estimator ˆ has bias 0 for every possible value
of , then it is said to be unbiased
• Observation: an unbiased estimator does not show
systematic departure from the parameter it estimates
• But for any particular sample, the estimate of ˆ
may be far away from the true value of
CS6762
64
Estimation (cont.)
• Example 6: consider Example 1
– ̂ = X = (X1+…+ Xn )/n
– We have E( ̂ ) = E(X1+…+ Xn )/n) =
(E(X1)+…+E(Xn ))/n = n/n =
– Thus E( ̂ ) - = 0 for any . So ̂ is unbiased
• Example 7: consider Example 3
– Let the density of ˆ1 be f(x) where x [0, ] n 1
nx
nx n 1
ˆ
– We can show that f(x) = n . Thus E(1) = n xdx
=
CS6762
0
n
n 1
65
Estimation (cont.)
– Since E( ˆ1 ) - = - /(n+1) 0, ˆ1 is not unbiased
– But ˆ2 is unbiased (intuitively clear?)
• Principle 1: Among different estimators, select one
that is unbiased
• Recall Var( ˆ ) = E[( ˆ - E( ˆ ))2]
1
1
1
– It measures how much our estimates will vary across
different observed samples
– Note: variance of estimator does not depend on the true
value of the parameter
• Principle 2: among all the unbiased estimators,
choose one with minimum variance
CS6762
66
Estimation (cont.)
• Example 8: consider Example 3 again
–
–
–
–
ˆ3
X
i
Define estimator
= 2X = 2 n
We can show that ˆ3 is also unbiased
But Var( ˆ2 ) = 2/[n(n+2)] and Var( ˆ3) = 2/(3n)
When n > 1, 2/[n(n+2)] < 2/(3n)
• In reality, bias and variance often compete, i.e.,
small biased estimator may have higher variance
• How to balance the two is a central issue in data
mining tasks
CS6762
67
Maximum Likelihood Estimation
• Consider the probability p(D|, M)
• Suppose the model structure is determined, we can
write it as p(D|)
• When viewed as a function of , write it as L(|D)
– Interpretation: for any given sample D, each value of
gives a probability that D will arise
• Maximum likelihood estimator (MLE): for each D,
assign a value to that maximizes the likelihood
function
– Intuition: MLE lets D have highest probability to be
selected
CS6762
68
MLE
• Example 9: estimate the proportion of the customers
that purchase milk in a supermarket, based on a
sample x1, …, xn. Assume the # of customers
purchasing milk follows a binomial distr.
– Proportion of the customers that purchase milk is the
likelihood that a random customer purchases milk, let it
be
– Assume among the n customers, r purchases milk
n r
nr
(
1
)
– Thus L(|n, r) = r
– All we need is to maximize r(1- )n-r
CS6762
69
MLE
– Differentiating w.r.t , and then set to zero, we get
rr-1(1- )n-r - (n-r)r(1- )n-r-1 = 0, i.e., r(1- )-(n-r) = 0
Thus = r/n
– The proportion purchasing milk in the sample =
maximum likelihood estimation of the proportion purchasing milk
in the entire customer population !
– (note: we can say the above only under binomial model)
– Consider the following three scenario
n=100, r=70
n=10, r=7
L()
L()
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CS6762
n=1000, r=700
L()
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
70
MLE (cont.)
• Example 10: assume x1,…, xn have arisen
independently from a normal distribution with unit
variance. Use MLE to estimate mean
– Density of each xi is
– Thus L(| x1,…, xn) =
n
1
( xi ) 2 / 2 2 =
e
2n
1 / 2 ( x )
(
2
)
e i
2
/2
1 ( xi ) 2 / 2
e
2
=
i 1
1
( xi ) 2
2 i1
take
n derivative w.r.t , set to zero and
simplify, we get ( xi ) 0
(2 )
n / 2
e
thus =
CS6762
x
i
i
i 1
/n (refer to example 1)
71
MLE (cont.)
• Example 11: Let X and Y be random variables, and
are related in the way: Y = a+bX+e where a and b
are constant, and e is a random variable with normal
distribution with mean 0 and variance 2. Use MLE
to estimate a and b.
– Let D = {(x1,y1),…,(xn,yn)} be the sample. We have, for
all i, yi = axin + b + ei, or ei = yi – (a + bxi)
1
( 0.5 ( yi ( a bxi )) / ) 2
L(a,b|D) =
=
e
n 2
0.5 / 2 ( yi ( a bxi )) 2
i 1
i 1
1
e
2 n/2
(2 )
CS6762
72
MLE (cont.)
• Note: maximizing
1
e
2 n/2
(2 )
0.5 /
n
2
( yi ( a bxi )) 2
i 1
is
n
equivalent to minimizing
2
(
y
(
a
bx
))
)
i
i
i 1
• This is called least squares method
CS6762
73
Bayesian Estimation
• Until now, our estimation is based on the following:
– Parameters of the population are fixed
– The sample is randomly drawn from that population, and therefore
has a probability distribution
– We search for values of the parameters to accommodate the
distributions of the samples
• Bayesian estimation takes a different view:
– The sample is fixed (after all, they have been observed)
– The parameters are random numbers that have a probability
distribution
– We use the sample values to gain information about the
distributions of the parameters
CS6762
74
Bayesian Estimation (cont.)
• Basic approach
– Before the sample data is analyzed, there is an initial
knowledge about the distribution p() of parameter ,
called prior distribution
– Analysis of the sample D leads to the modification to the
prior distribution, called posterior distribution, p(|D)
– The modification is based on Bayes theorem
• Bayes theorem: p(|D)= p( D | ) p( )
p( D)
• (Examples in Bayes classifications in later chapters)
CS6762
75
Mixtures of Models
• A dataset may be better described by more
than one distribution
• A mixture
density for attribute x is defined as
K
f k ( x | k ) k
f(x) =
k 1
– The data set is subdivided into K classes (or
components)
– The kth class has a density of fk(x|k)
– k is the probability that a randomly selected
object belongs to the kth class
– Note: k k = 1
CS6762
76
Mixtures of Models (cont.)
• Interpretation for the mixtures of normal distribution
– A single normal distribution can be thought of as being
located at the mean
– Its variance (co-variance matrix) determines its shape)
– The components of the dataset thus have different
locations and shapes
• Example 12
– a 1D (2D) data set consists of three classes
– Each has a normal distribution with certain weight
CS6762
77
Data set
CS6762
• Three normal
densities
• Each contour is
3 away from
mean for one
density
• Contours of
overall
probability
distribution
78
Mixtures of Models (cont.)
• Example 13
– The entire population of customers for a supermarket
can be grouped into different types, such as Asian,
English
– Different types of customers have different purchasing
characteristics
• People in Asian group are more likely to buy rice, tea,
vegetable (such as bok choi, nappo) etc. than English group
• People in English group are more likely to buy potato,
coffee, cheese than Asian group
– The probability distribution of the buying habits can be
better represented by the mixture model of two
densities, one for each group
CS6762
79