Data Mining - Mount Holyoke College
Download
Report
Transcript Data Mining - Mount Holyoke College
Data Mining
CS 341, Spring 2007
Lecture 4: Data Mining Techniques (I)
Review:
Information Retrieval
– Similarity measures
– Evaluation Metrics : Precision and Recall
Question Answering
Web Search Engine
– An application of IR
– Related to web mining
© Prentice Hall
2
Data Mining Techniques Outline
Goal: Provide an overview of basic data
mining techniques
Statistical
– Point Estimation
– Models Based on Summarization
– Bayes Theorem
– Hypothesis Testing
– Regression and Correlation
Similarity Measures
© Prentice Hall
3
Point Estimation
Point Estimate: estimate a population
parameter.
May be made by calculating the parameter for a
sample.
May be used to predict value for missing data.
Ex:
–
–
–
–
R contains 100 employees
99 have salary information
Mean salary of these is $50,000
Use $50,000 as value of remaining employee’s
salary.
Is this a good idea?
© Prentice Hall
4
Estimation Error
Bias: Difference between expected value and
actual value.
Mean Squared Error (MSE): expected value
of the squared difference between the
estimate and the actual value:
Root Mean Square Error (RMSE)
© Prentice Hall
5
Jackknife Estimate
Jackknife Estimate: estimate of
parameter is obtained by omitting one
value from the set of observed values.
Named to describe a “handy and useful
tool”
Used to reduce bias
Property: The Jackknife estimator
lowers the bias from the order of 1/n to
1/n2
© Prentice Hall
6
Jackknife Estimate
Definition:
– Divide the sample size n into g groups of
size m each, so n=mg. (often m=1 and
g=n)
– estimate j by ignoring the jth group.
– _ is the average of j .
– The Jackknife estimator is
» Q = g– (g-1)_.
Where is an estimator for the parameter theta.
© Prentice Hall
7
Jackknife Estimator: Example 1
Estimate of mean for X={x1, x2, x3,}, n =3, g=3,
m=1, = = (x1+ x2+ x3)/3
1 = (x2 + x3)/2, 2 = (x1 + x3)/2, 1 = (x1 + x2)/2,
_ = (1 + 2 + 2)/3
Q = g-(g-1) _= 3-(3-1) _= (x1 + x2 + x3)/3
In this case, the Jackknife Estimator is the
same as the usual estimator.
© Prentice Hall
8
Jackknife Estimator: Example 2
Estimate of variance for X={1, 4, 4}, n =3, g=3,
m=1, = 2
2 = ((1-3)2 +(4-3)2 +(4-3)2 )/3 = 2
1 = ((4-4)2 + (4-4)2 ) /2 = 0,
2 = 2.25 , 3 = 2.25
_ = (1 + 2 + 2)/3 = 1.5
Q = g-(g-1) _= 3-(3-1) _
=3(2)-2(1.5)=3
In this case, the Jackknife Estimator is
different from the usual estimator.
© Prentice Hall
9
Jackknife Estimator: Example 2(cont’d)
In general, apply the Jackknife technique
to the biased estimator 2
2 = (xi – x )2 / n
then the jackknife estimator is s2
s2 = (xi – x )2 / (n -1)
Which is known to be unbiased for 2
© Prentice Hall
10
Maximum Likelihood
Estimate (MLE)
Obtain parameter estimates that maximize
the probability that the sample data occurs for
the specific model.
Joint probability for observing the sample
data by multiplying the individual probabilities.
Likelihood function:
Maximize L.
© Prentice Hall
11
MLE Example
Coin toss five times: {H,H,H,H,T}
Assuming a perfect coin with H and T equally
likely, the likelihood of this sequence is:
However if the probability of a H is 0.8 then:
© Prentice Hall
12
MLE Example (cont’d)
General likelihood formula:
Estimate for p is then 4/5 = 0.8
© Prentice Hall
13
Expectation-Maximization
(EM)
Solves estimation with incomplete data.
Obtain initial estimates for parameters.
Iteratively use estimates for missing
data and continue until convergence.
© Prentice Hall
14
EM Example
© Prentice Hall
15
EM Algorithm
© Prentice Hall
16
Models Based on Summarization
Basic concepts to provide an abstraction
and summarization of the data as a
whole.
– Statistical concepts: mean, variance, median, mode,
etc.
Visualization: display the structure of the
data graphically.
– Line graphs, Pie charts, Histograms, Scatter plots,
Hierarchical graphs
© Prentice Hall
17
Scatter Diagram
© Prentice Hall
18
Bayes Theorem
Posterior Probability: P(h1|xi)
Prior Probability: P(h1)
Bayes Theorem:
Assign probabilities of hypotheses given a
data value.
© Prentice Hall
19
Bayes Theorem Example
Credit authorizations (hypotheses):
h1=authorize purchase, h2 = authorize after
further identification, h3=do not authorize,
h4= do not authorize but contact police
Assign twelve data values for all
combinations of credit and income:
1
Excellent
Good
Bad
x1
x5
x9
2
3
4
x2
x6
x10
x3
x7
x11
x4
x8
x12
From training data: P(h1) = 60%; P(h2)=20%;
P(h3)=10%; P(h4)=10%.
© Prentice Hall
20
Bayes Example(cont’d)
Training Data:
ID
1
2
3
4
5
6
7
8
9
10
Income
4
3
2
3
4
2
3
2
3
1
Credit
Excellent
Good
Excellent
Good
Good
Excellent
Bad
Bad
Bad
Bad
© Prentice Hall
Class
h1
h1
h1
h1
h1
h1
h2
h2
h3
h4
xi
x4
x7
x2
x7
x8
x2
x11
x10
x11
x9
21
Bayes Example(cont’d)
Calculate P(xi|hj) and P(xi)
Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6;
P(x8|h1)=1/6; P(xi|h1)=0 for all other xi.
Predict the class for x4:
– Calculate P(hj|x4) for all hj.
– Place x4 in class with largest value.
– Ex:
»P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)
=(1/6)(0.6)/0.1=1.
»x4 in class h1.
© Prentice Hall
22
Hypothesis Testing
Find model to explain behavior by
creating and then testing a hypothesis
about the data.
Exact opposite of usual DM approach.
H0 – Null hypothesis; Hypothesis to be
tested.
H1 – Alternative hypothesis
© Prentice Hall
23
Chi-Square Test
One technique to perform hypothesis testing
Used to test the association between two
observed variable values and determine if a
set of observed values is statistically different.
The chi-squared statistic is defines as:
O – observed value
E – Expected value based on hypothesis.
© Prentice Hall
24
Chi-Square Test
Given the average scores of five schools.
Determine whether the difference is
statistically significant.
Ex:
– O={50,93,67,78,87}
– E=75
– c2=15.55 and therefore significant
Examine a chi-squared significance table.
– with a degree of 4 and a significance level of 95%,
the critical value is 9.488. Thus the variance
between the schools’ scores and the expected
value cannot be associated with pure chance.
© Prentice Hall
25
Regression
Predict future values based on past values
Fitting a set of points to a curve
Linear Regression assumes linear
relationship exists.
y = c0 + c 1 x1 + … + c n xn
– n input variables, (called regressors or predictors)
– One out put variable, called response
– n+1 constants, chosen during the modlong
process to match the input examples
© Prentice Hall
26
Linear Regression -- with one input value
© Prentice Hall
27
Correlation
Examine the degree to which the values
for two variables behave similarly.
Correlation coefficient r:
• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation
© Prentice Hall
28
Correlation
Where X, Y are means for X and Y respectively.
Suppose X=(1,3,5,7,9) and Y=(9,7,5,3,1)
r=?
Suppose X=(1,3,5,7,9) and Y=(2,4,6,8,10)
r=?
© Prentice Hall
29
Similarity Measures
Determine similarity between two objects.
Similarity characteristics:
Alternatively, distance measure measure how
unlike or dissimilar objects are.
© Prentice Hall
30
Similarity Measures
© Prentice Hall
31
Distance Measures
Measure dissimilarity between objects
© Prentice Hall
32
Next Lecture:
Data Mining techniques (II)
– Decision trees, neural networks and
genetic algorithms
Reading assignments: Chapter 3
© Prentice Hall
33