Data analysis and uncertainty
Download
Report
Transcript Data analysis and uncertainty
Data analysis and uncertainty
Outline
• Random Variables
• Estimate
• Sampling
Introduction
• Reasons for Uncertainty
– Prediction
• Making a prediction about tomorrow based on data we have
today
– Sample
• Data maybe a sample from the population, and we don’t
know the difference between our data and other sample(or
population)
– Missing value or unknown value
• We need to guess these value
• Example : Censored Data
Introduction
• Dealing with Uncertainty
– Probability
– Fuzzy
• Probability Theory v.s. Probability Calculus
– Probability Theory
• Mapping from real world to the mathematical
representation
– Probability Calculus
• Based on well-defined and generally accepted axioms
• The aim is to explore the consequences of those axioms
Introduction
• Frequentist (Probability is objective)
– The probability of an event is defined as the
limiting proportion of times that the event would
occur in identical situations
– Example
• The proportion of times a head comes up in tossing a
same coin repeatedly
• Assess the probability that a customer in a supermarket
will buy a certain item(Use similarly customer)
Introduction
• Bayesian(Subjective probability)
– Explicit characterization of all uncertainty including
any parameters estimated from the data
– Probability is an individual degree of belief that a
given event will occur
• Frequentist v.s. Bayesian
– Toss a coin 10 times, get 7 head
– In Frequentist, probability is P(A) = 7/10
– In Bayesian, I guess a probability P(A) = 0.5, then use
this prior idea and the data to estimate probability
Random variable
• Mapping from property of objects to a variable
that can take a set of possible values via a process
that appears to the observer to have an element
of unpredictability
• Example
– Coin toss (domain is the set [heads , tails])
– No of times a coin has to be tossed to get a head
• Domain is integers
– Student’s score
• Domain is a set of integers between 0~100
Properties of single random variable
• X is random variable and x is its value
• Domain is finite:
– probability mass function p(x)
• Domain is real line:
– probability density function f(x)
• Expectation
of
X
– E[ X ] xi p( xi )
i 1
– E[ X ] x f ( x )dx
i i i
Multivariate random variable
• Set of several random variables
• For p-dimensional vector x={x1,..,xp}
• The joint mass function
p( X 1 x1 , X p x p ) p( x1 ,, x p )
The joint mass function
• For example
– Rolling two fair dice, X represent first dice’s result
and Y represent another
– Then p(x=3, y=3) = 1/6 * 1/6 = 1/36
The joint mass function
X=1
X=2
X=3
X=4
X=5
X=6
Px(X)
Y=1
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=2
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=3
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=4
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=5
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Y=6
1/36
1/36
1/36
1/36
1/36
1/36
0.17
Py(Y)
0.17
0.17
0.17
0.17
0.17
0.17
1.00
Marginal probability mass function
• The marginal probability mass function of X
and Y are
f X ( x) p( X x) p XY ( x, y )
y
fY ( y ) p(Y y ) p XY ( x, y )
x
Continuous
• Marginal probability density function of X and
Y are
f X ( x)
fY ( y )
f XY ( x, y )dy
f XY ( x, y )dx
Conditional probability
• Density of a single variable (or a subset of
complete set of variables) given (or
“conditioned on”) particular values of other
variables
• Conditional density of X given some value of Y
is denoted f(x|y) and defined as
f ( x, y )
f ( x | y)
f ( y)
Conditional probability
• For example
– If a student’s score is given at random
– Sample space is S = {0,1,…,100}
– What’s the probability that the student is fail?
•
P( F )
F
S
60
101
– Given that student’s score is even(including 0),
then what’s the probability that the student is fail?
• E 51
E F 30
P( F | E )
P( E F ) 30 / 101
P( E )
51 / 101
Supermarket data
Conditional independence
• Generic problem in data mining is finding
relationships between variables
– Is purchasing item A likely to be related to
purchasing item B?
• Variables are independent if there is no
relationship; otherwise they are dependent
• Independent if p(x,y)=p(x)p(y)
p( A B) p( A) p( B)
p( A | B)
p( A)
p( B)
p( B)
p( A B) p( A) p( B)
p( B | A)
p( B)
p( A)
p( A)
Conditional Independence: More
than 2 variables
• X is conditional independence of Y
– Given Z if for all values of X, Y, Z we have
– p ( x, y | z ) p ( x | z ) p ( y | z )
Conditional Independence: More
than 2 variables
• Example
– P(F)=60/101
– P(E∩F)=30/51
– Now E and F are dependence
– If student’s score !=100, then
•
•
•
•
P(F|B)=60/100
P(E|B)=1/2
P(E∩F|B)=30/100=60/100*1/2
Given B condition,E and F are independence
Conditional Independence: More
than 2 variables
• Example
– If student’s score == 100,then
•
•
•
•
P(F|C)=0
P(E|C)=1
P(E ∩ F|C)=0=1*0
Given C condition,E and F are independence
– Now we can calculate P(E ∩ F)
p( E F ) p( E F | grade 100) p( grade 100) p( E F | grade 100) p( grade 100)
100
0 0.5 0.6
101
30
101
Conditional Independence
• Conditional independence don’t imply
marginal independence
p ( x, y | z ) p ( x | z ) p ( y | z )
not imply
p ( x, y ) p ( x ) p ( y )
• Note that X and Y may be unconditionally
independence but conditionally dependent
given Z
On assuming independence
• Independence is a strong assumption
frequently violated in practice
• But provides modeling
– Fewer parameters
– Understandable models
Dependence and Correlation
• Covariance measures how X and Y vary
together
– Large positive if large X is associated with large Y,
and small X with small Y
– Negative if large X is associated with small Y
• Two variables may be dependent but no
linearly correlated
Correlation and Causation
• Two variables may be highly correlated
without a causal relationship between the two
– Yellow stained finger and lung cancer may be
correlated but causally linked only by a third
variable : smoking
– Human reaction time and earned income are
negatively correlated
• Does not mean one causes the other
• A third variable “age” is causally related to both
Samples and Statistical inference
• Samples can be used to model the data
• If goal is to detect the small deviations form
the data,the size of samples will effect the
result
Dual Role of Probability and Statistics
in Data Analysis
Outline
• Random Variable
• Estimate
– Maximum Likelihood Estimation
– Bayesian Estimation
• Sampling
Estimation
• In inference we want to make statements
about entire population from which sample is
drawn
• The two important methods for estimating
parameters of a model
– Maximum Likelihood Estimation
– Bayesian Estimation
Desirable properties of estimators
• Let ˆ be an estimate of parameter
• Two measures of estimator quality ˆ
– Expected value of estimate (Bias)
• Difference between expected and true value
Bias (ˆ) E[ˆ]
– Variance of Estimate
Var (ˆ) E[ˆ E[ˆ]]2
Mean squared error
• The mean of the squared difference between
the value of the estimator and the true value
of parameter
E[(ˆ ) 2 ]
• Mean squared error can be partitioned as sum
of squared bias and variance
E[(ˆ ) 2 ] ( Bias (ˆ)) 2 Var (ˆ)
Mean squared error
E[(ˆ ) 2 ] E[(ˆ E[ˆ] E[ˆ] ) 2 ]
E[(ˆ E (ˆ)) 2 ( E (ˆ) ) 2 2( E[(ˆ E[ˆ])( E[ˆ] )])]
Var (ˆ) Bias 2 (ˆ) 2 E[ˆE[ˆ] ˆ E[ˆ]E[ˆ] E[ˆ] ]
Var (ˆ) Bias 2 (ˆ) 2[( E (ˆ)) 2 ( E (ˆ)) 2 E (ˆ) E (ˆ)]
Var (ˆ) Bias 2 (ˆ)
(a b b c) 2 (a b) 2 (b c) 2 2(a b)(b c)
a 2 2ab b 2 b 2 2bc c 2 2ab 2ac 2b 2 2bc
a 2 2ac b 2
E[ E[ X ]] E[ X ]
E[c] c where c is a constant
Maximum Likelihood Estimation
• Most widely used method for parameter
estimation
• Likelihood Function is probability that data D
would have arisen for a given value of θ
L( | D) L( | x(1), , x(n))
p ( x(1), x(n) | )
n
p ( x(i ) | )
i 1
• Value of θ for which the data has the highest
probability is the MLE
Example of MLE for Binomial
• Customers either purchase or not purchase
milk
– We want estimate of proportion purchasing
• Binomial with unknown parameterθ
• Samples x(1),…,x(1000) where r purchase milk
• Assuming conditional independence,
likelihood function is
1000
L( | x(1), x(1000)) x (i ) (1 ) (1 x (i )) r (1 ) (1000 r )
i
Log-likelihood Function
• We want the highest probability,so change
to Log-likelihood function
l ( ) log L( ) r log( ) (1000 r ) log( 1 )
• Then Differentiating and setting equal to zero
r log( ) (1000 r ) log( 1 ) r (1000 r )
d
(1 )
r (1000 r )
Let
0
(1 )
r r 1000 r
r
1000
Example of MLE for Binomial
• r milk purchases out of n customers
• θis the probability that milk is purchased by
random customer
• For 3 data set
– r = 7,n =10
– r = 70,n =100
– r = 700,n =1000
• Uncertainty becomes smaller as n increases
Example of MLE for Binomial
Likelihood under Normal Distribution
• For 1 variance,Unknown mean
• Likelihood function
n
1
L( | x(1), , x(n)) (2 ) 1/ 2 exp( ( x(i ) ) 2 )
2
i 1
(2 )
n / 2
1 n
exp( ( x(i ) ) 2 )
2 i 1
( x )2
f ( x)
e
2
2
2
2
1
Log-likelihood function
n
1 n
l ( | x(1), , x(n)) log 2 ( x(i ) ) 2
2
2 i 1
• To find the MLE set derivative d/dθ to zero
dl ( ) n
( x(i ) )
d
i 1
n
Let ( x(i ) ) 0
i 1
n
n
i 1
i 1
( x(i ) ) x(i ) n 0
n
x(i)
i 1
n
Likelihood under Normal Distribution
• θis the estimated mean
• For 2 data set(By random)
– 20 data points
– 200 data points
Likelihood under Normal Distribution
Sufficient statistic
• Quantity s(D) is a sufficient statistic forθ if
the likelihood l(θ) only depends on the data
through s(D)
• no other statistic which can be calculated from
the same sample provides any additional
information as to the value of the parameter
Interval estimate
• Point estimate doesn’t convey uncertainty
associated with it
• Interval estimate provide a confidence interval
Likelihood under Normal Distribution
Normal distributi on :
( x )2
f ( x)
exp(
)
2
2
2
2
Likelihood function :
1
n
2 2
L( | x1 , xn ) (2 ) exp(
n
1
2 2
2
(
x
)
)
i
i 1
Log likelihood function :
n
1
l ( | x1 , xn ) log( 2 2 ) 2
2
2
n
2
(
x
)
i
i 1
Mean
1
n
l ( | x1 , xn ) log( 2 2 ) 2
2
2
dl ( | x1 , xn ) 1
2
d
1
2
n
(x )
(x ) 0
i
i 1
n
( xi ) 0
i 1
n
( xi ) n 0
i 1
n
(x )
i
i 1
n
n
i 1
i
n
(x )
i 1
i
2
Variance
n
1
2
l ( | x1 , xn ) log( 2 ) 2
2
2
n
2
(
x
)
i
i 1
dl ( | x1 , xn ) n
1
1
2 (
)
2
2
d
2
2
2 4
n
1
1
2 (
)
2
2 2
2 4
n
2 2
1
2 4
n
2
(
x
)
i
i 1
1 n
( xi ) 2
n i 1
2
n
n
2
(
x
)
i
i 1
2
(
x
)
0
i
i 1
Outline
• Random Variable
• Estimate
– Maximum Likelihood Estimation
– Bayesian Estimation
• Sampling
Bayesian approach
• Frequestist approach
– The parameters of population are fixed but unknown
– Data is a random sample
– Intrinsic variability lies in data D {x(1), , x(n)}
• Bayesian approach
–
–
–
–
Data are known
Parameters θ are random variables
θhas a distribution of values
p ( ) reflects degree of belief on where true
parameters θ may be
Bayesian estimation
• Modification done by Bayesian rule
p( | D)
p( D | ) p( )
p( D)
p( D | ) p( )
p( D | ) p( )d
• Leads to a distribution rather than single value
– Single value can be obtained by mean or mode
Bayesian estimation
• P(D) is a constant independent of θ
• For a given data set D and a particular
model(model = distribution for prior and
likelihood)
p( | D) p( D | ) p( )
• If we have a weak belief about parameter
before collecting data, choose a wide
prior(normal with large variance)
Binomial example
• Single binary variable X : wish to estimate
p ( X 1)
• Prior for parameter in [0, 1] is the Beta
distribution
p( ) ( 1) (1 ) ( 1)
where 0 and 0 are two parameters of this model
( ) ( 1)
Beta( | , )
(1 ) ( 1)
( )( )
Binomial example
• Likelihood function
L( | D) (1 )
• Combining likelihood and prior
r
( nr )
p( | D) p( D | ) p( ) r (1 ) nr ( 1) (1 )( 1) ( r 1) (1 )( nr 1)
• We get another Beta distribution
– With parameters r and n r
Beta(5,5) and Beta(145,145)
Beta(5,5)
Beta(45,50)
Advantages of Bayesian approach
• Retain full knowledge of all problem uncertainty
• Calculating full posterior distribution onθ
• Natural updating of distribution
p( | D1 , D2 ) p( D2 | ) p( D1 | ) p( )
Predictive distribution
• In equation to modify prior to posterior
p( | D)
p( D | ) p( )
p( D)
p( D | ) p( )
p( D | ) p( )d
• Denominator is called predictive distribution
of D
• Useful for model checking
– If observed data have only small probability then
it is unlikely to be correct
Normal distribution example
• Suppose x comes from a normal distribution
With unknown mean θand known variance
α
• Prior distribution for θis ~ N ( 0 , 0 )
Normal distribution example
p ( | x) p ( x | ) p ( )
1
1
1
1
2
exp(
(x ) )
exp(
( 0 ) 2 )
2
2 0
2
20
1 2 1 1
x
x2 0
exp( ( ) ( )
)
2
0
0 2 2 0
2
1
let 1 ( 0 1) 1 1 1 (
0 x
)
0
1
1
exp( 2 / 1 1 / 1 ) exp( ( 1 ) 2 / 1 )
2
2
1
1
p ( | x)
exp( ( 1 ) 2 / 1 )
2
21
Jeffrey’s prior
• A reference prior
• Fisher information
log L( | x)
I ( | x) E[
]
2
2
• Jeffrey’s prior
p( ) I ( | x)
Conjugate priors
• p(θ) is a conjugate prior for p(x| θ) if the
posterior distribution p(θ|x) is in the same
family as the prior p(θ)
– Beta to Beta
– Normal distribution to Normal distribution
Outline
• Random Variable
• Estimate
– Maximum Likelihood Estimation
– Bayesian Estimation
• Sampling
Sampling in Data Mining
• The data set is only fit statistical analysis
– “Experimental design” in statistics is concerned
with optimal ways of collecting data
– Data miners can’t control the data collection
process
– The data may be ideally suited to the purposes for
which it was collected, but not adequate for its
data mining uses
Sampling in Data Mining
• Two ways in which sample arise
– Database is sample of population
– Database contains every cases, but the analysis is
based on the sample
• Not appropriate when we want to find unusual records
Why sampling
• Draw a sample from the database that allows
us to construct a model reflects the structure
of the data in the database
– Efficiency, quicker, easier
• The sample must representative of the entire
database
Systematic sampling
• Try to ensure representativeness
– Taking one out of every two records
• Can lead to problems when there are
regularities in database
– Data set where records are of married couples
Random Sampling
• Avoiding regularities
• Epsem Sampling
– Each record has same probability of being chosen
Variance of Mean of Random Sample
• If variance of population of size N is , the
variance of mean of a simple random sample
of size n without replacement is (1 n )
2
2
n
N
– Usually N >> n, so the second term is small, and
variance decreases as sample size increases
Example
• 2000 points, population mean = 0.0108
• Random sample n = 10, 100, 1000, repeat 200
times
Example
Stratified Random Sampling
• Split population into non-overlapping
subpopulations or strata
• Advantages
– Enable making statements about each of the
subpopulations separately
• For example, one of the credit card companies
we work with categorizes transactions into 26
categories : supermarket, gas station, and so
on
Mean of Stratified Sample
• The total size of population is N
th
• k stratum has N k elements in it
• nk are chosen for the sample from this
stratum
th
• Sample mean within k stratum is xk
• Estimate of population mean
N k xk
N
Cluster Sampling
• Every cluster contains many elements
• Simple random sample on elements is not
appropriate
• Select cluster, not element