Data Mining - Mount Holyoke College

Download Report

Transcript Data Mining - Mount Holyoke College

Data Mining
CS 341, Spring 2007
Lecture 4: Data Mining Techniques (I)
Review:

Information Retrieval
– Similarity measures
– Evaluation Metrics : Precision and Recall
Question Answering
 Web Search Engine

– An application of IR
– Related to web mining
© Prentice Hall
2
Data Mining Techniques Outline
Goal: Provide an overview of basic data
mining techniques
 Statistical
– Point Estimation
– Models Based on Summarization
– Bayes Theorem
– Hypothesis Testing
– Regression and Correlation

Similarity Measures
© Prentice Hall
3
Point Estimation




Point Estimate: estimate a population
parameter.
May be made by calculating the parameter for a
sample.
May be used to predict value for missing data.
Ex:
–
–
–
–
R contains 100 employees
99 have salary information
Mean salary of these is $50,000
Use $50,000 as value of remaining employee’s
salary.
Is this a good idea?
© Prentice Hall
4
Estimation Error

Bias: Difference between expected value and
actual value.

Mean Squared Error (MSE): expected value
of the squared difference between the
estimate and the actual value:

Root Mean Square Error (RMSE)
© Prentice Hall
5
Jackknife Estimate
Jackknife Estimate: estimate of
parameter is obtained by omitting one
value from the set of observed values.
 Named to describe a “handy and useful
tool”
 Used to reduce bias
 Property: The Jackknife estimator
lowers the bias from the order of 1/n to
1/n2

© Prentice Hall
6
Jackknife Estimate

Definition:
– Divide the sample size n into g groups of
size m each, so n=mg. (often m=1 and
g=n)
– estimate j by ignoring the jth group.
– _ is the average of j .
– The Jackknife estimator is
» Q = g– (g-1)_.
Where  is an estimator for the parameter theta.
© Prentice Hall
7
Jackknife Estimator: Example 1





Estimate of mean for X={x1, x2, x3,}, n =3, g=3,
m=1,  =  = (x1+ x2+ x3)/3
1 = (x2 + x3)/2, 2 = (x1 + x3)/2, 1 = (x1 + x2)/2,
_ = (1 + 2 + 2)/3
Q = g-(g-1) _= 3-(3-1) _= (x1 + x2 + x3)/3
In this case, the Jackknife Estimator is the
same as the usual estimator.
© Prentice Hall
8
Jackknife Estimator: Example 2






Estimate of variance for X={1, 4, 4}, n =3, g=3,
m=1,  = 2
2 = ((1-3)2 +(4-3)2 +(4-3)2 )/3 = 2
1 = ((4-4)2 + (4-4)2 ) /2 = 0,
2 = 2.25 , 3 = 2.25
_ = (1 + 2 + 2)/3 = 1.5
Q = g-(g-1) _= 3-(3-1) _
=3(2)-2(1.5)=3

In this case, the Jackknife Estimator is
different from the usual estimator.
© Prentice Hall
9
Jackknife Estimator: Example 2(cont’d)

In general, apply the Jackknife technique
to the biased estimator 2
2 =  (xi – x )2 / n

then the jackknife estimator is s2
s2 =  (xi – x )2 / (n -1)
Which is known to be unbiased for 2
© Prentice Hall
10
Maximum Likelihood
Estimate (MLE)



Obtain parameter estimates that maximize
the probability that the sample data occurs for
the specific model.
Joint probability for observing the sample
data by multiplying the individual probabilities.
Likelihood function:
Maximize L.
© Prentice Hall
11
MLE Example

Coin toss five times: {H,H,H,H,T}

Assuming a perfect coin with H and T equally
likely, the likelihood of this sequence is:

However if the probability of a H is 0.8 then:
© Prentice Hall
12
MLE Example (cont’d)

General likelihood formula:

Estimate for p is then 4/5 = 0.8
© Prentice Hall
13
Expectation-Maximization
(EM)
Solves estimation with incomplete data.
 Obtain initial estimates for parameters.
 Iteratively use estimates for missing
data and continue until convergence.

© Prentice Hall
14
EM Example
© Prentice Hall
15
EM Algorithm
© Prentice Hall
16
Models Based on Summarization

Basic concepts to provide an abstraction
and summarization of the data as a
whole.
– Statistical concepts: mean, variance, median, mode,
etc.

Visualization: display the structure of the
data graphically.
– Line graphs, Pie charts, Histograms, Scatter plots,
Hierarchical graphs
© Prentice Hall
17
Scatter Diagram
© Prentice Hall
18
Bayes Theorem




Posterior Probability: P(h1|xi)
Prior Probability: P(h1)
Bayes Theorem:
Assign probabilities of hypotheses given a
data value.
© Prentice Hall
19
Bayes Theorem Example


Credit authorizations (hypotheses):
h1=authorize purchase, h2 = authorize after
further identification, h3=do not authorize,
h4= do not authorize but contact police
Assign twelve data values for all
combinations of credit and income:
1
Excellent
Good
Bad

x1
x5
x9
2
3
4
x2
x6
x10
x3
x7
x11
x4
x8
x12
From training data: P(h1) = 60%; P(h2)=20%;
P(h3)=10%; P(h4)=10%.
© Prentice Hall
20
Bayes Example(cont’d)

Training Data:
ID
1
2
3
4
5
6
7
8
9
10
Income
4
3
2
3
4
2
3
2
3
1
Credit
Excellent
Good
Excellent
Good
Good
Excellent
Bad
Bad
Bad
Bad
© Prentice Hall
Class
h1
h1
h1
h1
h1
h1
h2
h2
h3
h4
xi
x4
x7
x2
x7
x8
x2
x11
x10
x11
x9
21
Bayes Example(cont’d)



Calculate P(xi|hj) and P(xi)
Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6;
P(x8|h1)=1/6; P(xi|h1)=0 for all other xi.
Predict the class for x4:
– Calculate P(hj|x4) for all hj.
– Place x4 in class with largest value.
– Ex:
»P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)
=(1/6)(0.6)/0.1=1.
»x4 in class h1.
© Prentice Hall
22
Hypothesis Testing
Find model to explain behavior by
creating and then testing a hypothesis
about the data.
 Exact opposite of usual DM approach.
 H0 – Null hypothesis; Hypothesis to be
tested.
 H1 – Alternative hypothesis

© Prentice Hall
23
Chi-Square Test





One technique to perform hypothesis testing
Used to test the association between two
observed variable values and determine if a
set of observed values is statistically different.
The chi-squared statistic is defines as:
O – observed value
E – Expected value based on hypothesis.
© Prentice Hall
24
Chi-Square Test


Given the average scores of five schools.
Determine whether the difference is
statistically significant.
Ex:
– O={50,93,67,78,87}
– E=75
– c2=15.55 and therefore significant

Examine a chi-squared significance table.
– with a degree of 4 and a significance level of 95%,
the critical value is 9.488. Thus the variance
between the schools’ scores and the expected
value cannot be associated with pure chance.
© Prentice Hall
25
Regression



Predict future values based on past values
Fitting a set of points to a curve
Linear Regression assumes linear
relationship exists.
y = c0 + c 1 x1 + … + c n xn
– n input variables, (called regressors or predictors)
– One out put variable, called response
– n+1 constants, chosen during the modlong
process to match the input examples
© Prentice Hall
26
Linear Regression -- with one input value
© Prentice Hall
27
Correlation
Examine the degree to which the values
for two variables behave similarly.
 Correlation coefficient r:

• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation
© Prentice Hall
28
Correlation


Where X, Y are means for X and Y respectively.
Suppose X=(1,3,5,7,9) and Y=(9,7,5,3,1)
r=?

Suppose X=(1,3,5,7,9) and Y=(2,4,6,8,10)
r=?
© Prentice Hall
29
Similarity Measures



Determine similarity between two objects.
Similarity characteristics:
Alternatively, distance measure measure how
unlike or dissimilar objects are.
© Prentice Hall
30
Similarity Measures
© Prentice Hall
31
Distance Measures

Measure dissimilarity between objects
© Prentice Hall
32
Next Lecture:

Data Mining techniques (II)
– Decision trees, neural networks and
genetic algorithms

Reading assignments: Chapter 3
© Prentice Hall
33