Transcript titel
DATA MINING
van data naar informatie
Ronald Westra
Dep. Mathematics
Maastricht University
1
2
DATA ANALYSIS AND
UNCERTAINTY
Data Mining Lecture V
[Chapter 4, Hand, Manilla, Smyth ]
3
4.1 Introduction
4.2 Probability and its interpretation
4.3 (multiple) random variables
4.4 samples, models, and patterns
4
4.2 Probability and its interpretation
Probability,
Possibility,
Chance,
Randomness,
luck,
hazard,
fate,
…
5
4.2 Probability and its interpretation
Probability Theory: interpretation
Probability Calculus: mathematical manipulation and
computing
6
4.2 Probability and its interpretation
Frequentist view: probability is objective:
Prob = # times/# total
EXAMPLE: traditional
Prob.Theor.
Subjective probability: subjective probabilities
EXAMPLE: Bayesian statistics.
7
4.3 Random Variables and their
Relationships
Random Variables [4.3]
random variable X:
multivariate random variables
marginal density
8
4.3 Random Variables and their
Relationships
conditional density & dependency:
* example: supermarket purchases
9
4.3 Random Variables
10
4.3 Random Variables
11
RANDOM VARIABLES
Example: supermarket purchases
X = n customers x p products;
X(i,j) = Boolean variable: “Has customer #i bought a product of
type j ?”
nA = sum(X(:,A)) is number of customers that bought product A
nB = sum(X(:,B)) is number of customers that bought product B
nAB = sum(X(:,A).*X(:,B)) is number of customers that bought both
products A and B
12
RANDOM VARIABLES
Example: supermarket purchases
customer#
1
2
3
4
A
1
0
1
1
B
0
1
1
1
pA = 3/4
pB = 3/4
pAB = 2/4
pA.pB ≠ pAB
13
RANDOM VARIABLES
(conditionally) independent:
p(x,y) = p(x)*p(y)
i.e. :
p(x|y) = p(x)
14
Markov Models
15
RANDOM VARIABLES
Simpson’s paradox
Observation of two different treatments for several categories
of patients
Category
Treatment A
Treatment B
Old
0.33
0.20
Young
1.00
0.53
Total
0.50
0.40
16
RANDOM VARIABLES
Explanation: the sets are POOLED:
Category
Treatment A
Old
2 / 10
0.20
Young
48 / 90
0.53
Total
50 / 100
0.50
Treatment B
30 / 90
0.33
10 / 10
1.00
40 / 100
0.40
Both for Old and Young individually, treatment B seems best, but for total
Treatment A seems superior.
17
RANDOM VARIABLES
-
1st order Markov processes [example: text analysis and reconstruction with Markov
chains]
Demo: dir *.txt
type 'werther.txt'
[m0,m1,m2] = Markov2Analysis('werther.txt');
CreateMarkovText(m0,m1,m2,3,80,20);
Nederlands, duits, frans, engels, matlabs, europees
Correlation, dependency, causation
Example 4.3
18
SAMPLING
Sampling [4.4]
-
Large quantities of observations: central limit theorem : normally distributed
Small sample size: modeling: OK, pattern recognition: not
Figure 4.1
Role of model parameters
Probability of observing data D = { x(1), x(1), …, x(n)}
with independent probabilities:
n
P( D | M , ) p(x(i) | M , )
i 1
with fixed parameter-set Θ.
19
SAMPLING
20
ESTIMATION
Estimation [4.5]
-
Properties
1. ˆ is estimator of θ – depends on sample so stochastic variable
with E[ˆ] etc.
2. Bias: bias(ˆ) E[ˆ]
unbiased: bias(ˆ) 0 , i.e. no systematic bias
3. consistent estimator : asymptotically unbiased
4. Variance: var(ˆ) E[ˆ E[ˆ]]2
5. Best unbiased estimator: unbiased estimator with smallest variance
21
Maximum Likelihood Estimation
-
Maximum Likelihood Estimation
n
-
Expression 4.7: L( | D) f (x(i) | )
i 1
scalar function in θ
drop all terms not containing θ
value of θ with highest L(θ) is maximum likelihood estimator MLE:
ˆMLE arg max L( )
-
Example 4.4 + figure 4.2
Demo: MLbinomial
22
Maximum Likelihood Estimation
-
Example 4.5: Normal distribution, log L(θ)
Example 4.6: Sufficient statistic: nice idea but practical infeasible
MLE: biased bur consistent O(1/n)
In practice log- likelihood l(θ) = log L(θ) is more attractive
Multiple parameters: complex but relevant: EM-algorithm [treated later]
Determination of confidence levels: example 4.8
23
Maximum Likelihood Estimation
-
Sampling using central limit theorem (CLT) with large samples:
Bootstrap method: Example 4.9:
use observed data as i. The real distribution F, ii. To generate many sub-samples, iii.
Estimate properties in these sub-samples using the “known” distribution F, iv.
Compare sub-sample estimate and “real” estimate and apply CLT.
24
BAYESIAN ESTIMATION
-
Bayesian estimation
-
Here not frequentist approach but subjective approach: D is given and parameters θ
are random variables.
p(θ) represents our degree of belief in the value of θ
in practice this means that we have to make all sorts of (wild) assumptions about p.
prior distributon: p(θ)
posterior distributon: p(θ|D)
Modification of prior to posterior by Bayes theorem:
p( D | ) p( )
p( D | ) p( )
p ( | D)
p( D)
p( D | ) p( )d
-
25
BAYESIAN ESTIMATION
-
MAP: Maximum a Posteriori method: < θ> : mean or mode of the posterior
distribution
MAP relates to MLE
Example 4.10
Bayesian approach: rather than point-estimates keep full knowledge of uncertainties in
model(s)
This causes massive computation -> therefore only recent decade popular
26
BAYESIAN ESTIMATION
-
Sequential updating:
p( | D) p( D | ) p( ) (equation 4.10)
p( | D1 , D2 ) p( D2 | ) p( D1 | ) p( ) (equation 4.15)
Reminds of Markov Process
Algorithm:
* start with p(θ)
* new data D: use eq. 4.10 to obtain posterior distribution
* new data D2 : use eq. 4.15 to obtain new posterior distribution
* repeat ad nauseam
27
BAYESIAN ESTIMATION
-
Large Problem: different observers may define different p(θ); the choice is subjective!
hence: different results! Analysis depends on choice of p(θ)
-
Often a consensus prior e.g. Jeffrey’s prior (equations 4.16 and 4.17)
-
Computation of credibility interval
-
Markov Chain Monte Carlo (MCMC) methods for estimating posterior distributions
-
Similarly: Bayesian Belief Networks (BBN)
28
BAYESIAN ESTIMATION
-
Bayesian Approach: Uncertainty_Model + Uncertainty_Params
MLE: aim is: a point estimate
Bayesian: aim is: posterior distribution
Bayesian estimate = weighted estimate over all models M and all parameters θ
where the weights are the likelihoods of the parameters in the different models
PROBLEM: these weights are difficult to estimate
29
30
PROBABILISTIC MODEL-BASED
CLUSTERING USING MIXTURE
MODELS
Data Mining Lecture VI
[4.5, 8.4, 9.2, 9.6, Hand, Manilla, Smyth ]
31
Probabilistic Model-Based
Clustering using Mixture Models
A probability mixture model
A mixture model is a formalism for modeling a
probability density function as a sum of
parameterized functions. In mathematical
terms:
32
A probability mixture model
where pX(x) is the modeled probability
distribution function, K is the number of
components in the mixture model, and ak is
mixture proportion of component k. By
definition, 0 < ak < 1 for all k = 1…K and:
33
A probability mixture model
h(x | λk) is a probability distribution parameterized by λk.
Mixture models are often used when we know h(x) and
we can sample from pX(x), but we would like to
determine the ak and λk values.
Such situations can arise in studies in which we sample
from a population that is composed of several distinct
subpopulations.
34
A common approach for ‘decomposing’ a
mixture model
It is common to think of mixture modeling as a
missing data problem. One way to understand this is
to assume that the data points under consideration
have "membership" in one of the distributions we
are using to model the data. When we start, this
membership is unknown, or missing. The job of
estimation is to devise appropriate parameters for
the model functions we choose, with the connection
to the data points being represented as their
membership in the individual model distributions.
35
Probabilistic Model-Based
Clustering using Mixture Models
The EM-algorithm [book section 8.4]
36
Mixture Decomposition:
The ‘Expectation-Maximization’ Algorithm
The Expectation-maximization algorithm
computes the missing memberships of data points
in our chosen distribution model.
It is an iterative procedure, where we start with initial
parameters for our model distribution (the ak's and
λk's of the model listed above).
The estimation process proceeds iteratively in two
steps, the Expectation Step, and the Maximization
Step.
37
The ‘Expectation-Maximization’ Algorithm
The expectation step
With initial guesses for the parameters in our
mixture model, we compute "partial membership"
of each data point in each constituent distribution.
This is done by calculating expectation values for
the membership variables of each data point.
38
The ‘Expectation-Maximization’ Algorithm
The maximization step
With the expectation values in hand for group
membership, we can recompute plug-in estimates
of our distribution parameters.
For the mixing coefficient of this is simply the
fractional membership of all data points in the
second distribution.
39
EM-algorithm for Clustering
The Suppose we have data D with a model with
parameters and hidden parameters H
Interpretation: H = the class label
Log-likelihood of observed data:
l ( ) log p( ) log p( D, H | )
H
40
EM-algorithm for Clustering
With p the probability over the data D.
Let Q be the unknown distribution over the
hidden parameters H
Then the log-likelihood is:
41
Then the log-likelihood is:
l ( ) log p( D, H | )
H
p( D, H | )
l ( ) log Q( H )
Q( H )
H
p( D, H | )
l ( ) Q( H ).log
Q( H )
H
[*Jensen’s inequality]
1
l ( ) Q( H ).log p( D, H | ) Q( H ).log
F (Q, )
Q( H )
H
H
42
Jensen’s inequality
for a concave-down function, the expected value of the function is less than the
function of the expected value. The gray rectangle along the horizontal axis
represents the probability distribution of x, which is uniform for simplicity,43but the
general idea applies for any distribution
EM-algorithm
So: F(Q,) is a lower-bound on the log-likelihood
function l(Q,) .
EM alternates between:
E-step: maximising F to Q with fixed , and:
M-step: maximising F to with fixed Q .
44
EM-algorithm
E-step:
M-step:
Q
k 1
k 1
arg max F (Q , )
k
k
Q
arg max F (Q
k 1
, )
k
45
Probabilistic Model-Based
Clustering using Gaussian Mixtures
46
Probabilistic Model-Based Clustering using Mixture Models
47
Probabilistic Model-Based Clustering using Mixture Models
48
Probabilistic Model-Based
Clustering using Mixture Models
49
Probabilistic Model-Based
Clustering using Mixture Models
Gaussian Mixture Decomposition
Gaussian mixture Decomposition is a good
classificator. It allows supervised as well as
unsupervised learning (find how many classes
is optimal, how they should be defined,...). But
training is iterative and time consuming.
Idea is to set position and width of gaussian
distribution(s) to optimize the coverage of
learning samples.
50
Probabilistic Model-Based
Clustering using Mixture Models
51
Probabilistic Model-Based Clustering using Mixture Models
52
Probabilistic Model-Based
Clustering using Mixture Models
53
Probabilistic Model-Based
Clustering using Mixture Models
54
Probabilistic Model-Based Clustering using Mixture Models
55
Probabilistic Model-Based Clustering using Mixture Models
56
Probabilistic Model-Based Clustering using Mixture Models
57
Probabilistic Model-Based Clustering using Mixture Models
58
Probabilistic Model-Based Clustering using Mixture Models
59
Probabilistic Model-Based Clustering using Mixture Models
60
61
The End
62