DMML3_stats - Mathematical & Computer Sciences

Download Report

Transcript DMML3_stats - Mathematical & Computer Sciences

Data Mining
(and machine learning)
DM Lecture 3: Basic Statistics for data miners
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures
All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
•
25/9
Overview of DM (and of these 8 lectures)
•
02/10:
Data Cleaning - usually a necessary first step for large amounts of data
• 09/10
•
•
•
•
•
•
•
Basic Statistics for Data Miners - essential knowledge, and very useful
16/10
Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used
much in industry
NO THURSDAY LECTURE OCTOBER 23rd
30/10
Cluster Analysis and Clustering - simple algs that tell you much about the data
NO THURSDAY LECTURE November 6th
13/11:
Similarity and Correlation Measures - making sure you do clustering appropriately
for the given data
20/11:
Regression - the simplest algorithm for predicting data/class values
27/11:
A Tour of Other Methods and their Essential Details - every important method
you may learn about in future
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Today you will see
important
theorem in
science
The most
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Statistical Data Mining
• Definitions
– Population, Sample, Statistic
• Simple Statistics
– Mean, Mode, Median
– Range, Variance, Standard Deviation
• Probability Distributions
– Normal distribution
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Fundamental Statistics Definitions
• A Population is the total collection of all items/individuals/events
under consideration
• A Sample is that part of a population which has been
observed or selected for analysis
E.g. all students is a population. Students at HWU is a
sample; this class is a sample, etc …
• A Statistic is a measure which can be computed to describe a
characteristic of the sample (e.g. the sample mean)
The reason for doing this is almost always to estimate (i.e. make a
good guess) things about that characteristic in the population
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
E.g.
• This class is a sample from the population of students at HWU
(it can also be considered as a sample of other populations – like
what?)
• One statistic of this sample is your mean weight. Suppose that is
65Kg. I.e. this is the sample mean.
• Is 65Kg a good estimate for the mean weight of the population?
•Another statistic: suppose 10% of you are married. Is this a good
estimate for the proportion that are married in the population?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some Simple Statistics
• The Mean (average) is the sum of the values in a sample divided by the
number of values
•
The Median is the midpoint of the values in a sample (50% above; 50%
below) after they have been ordered (e.g. from the smallest to the largest)
•
The Mode is the value that appears most frequently in a sample
•
The Range is the difference between the smallest and largest values in a
sample
•
The Variance is a measure of the dispersion of the values in a sample –
how closely the observations cluster around the mean of the sample
•
The Standard Deviation is the square root of the variance of a sample
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Standard Deviation and other `moment’s
• The m-th moment about the mean (μ) of a sample is:

1
m

(x  )
n xSample
Where n is the number of items in the sample.
• The first moment (m = 1) is 0!
• The second moment (m = 2) is the variance
• (and: square root of the variance is the standard deviation)
• The third moment can be used in tests for skewness
• The fourth moment can be used in tests for kurtosis
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Distributions / Histograms
A Normal (aka Gaussian) distribution (image from Mathworld)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Distributions / Histograms
Uniform distributions. Every possible value tends to be equally likely
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Probability Distributions
• If a population is expected to match a standard probability
distribution then a wealth of statistical knowledge and
results can be brought to bear on its analysis
• Many standard statistical techniques are based on the
assumption that the underlying distribution of a
population is Normal (Gaussian)
• Statistical tests have been developed to determine whether
a sampled population is normally distributed
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
An important aside …
This is the standard deviation of a sample
Std is square root of

1
2

(x  )
n xSample
This is slightly different, called the sample
standard deviation
Sample Std is square root of

1
2

(x  )
(n  1) xSample
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
A closer look at the normal distribution
This is the ND with mean mu and std sigma
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
More than just a pretty bell shape
Suppose mean of your
sample is 1.8; and
suppose std of your
sample is 0.12
Theory tells us that if a
population is Normal,
the sample std is a fairly
good guess at the
population std
So, we can say with some confidence, for example, that 99.7% of
the population lies between 1.44 and 2.16
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Date
23rd Nov
24th Nov
Sales
£25,609
£26,202
Returns
£1,003
£1,601
Net income
£24,506
£24,601
25th Nov
£28,936
£1,178
£25,758
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The Central Limit Theorem
Sir Francis Galton (Natural Inheritance, 1889) described the Central
Limit Theorem as:
“I know of scarcely anything so apt to impress the imagination as
the wonderful form of cosmic order expressed by the "Law of
Frequency of Error". The law would have been personified by the
Greeks and deified, if they had known of it. It reigns with serenity
and in complete self-effacement, amidst the wildest confusion. The
huger the mob, and the greater the apparent anarchy, the more
perfect is its sway. It is the supreme law of Unreason. Whenever a
large sample of chaotic elements are taken in hand and marshaled in
the order of their magnitude, an unsuspected and most beautiful
form of regularity proves to have been latent all along.”
(from the wikipedia article)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
the more tosses of the
coin in each expt, the more
the closer the distribution
of heads is to a Normal
distribution.
Same with :
• dist of sum of two dice
• dists of heights, weights,
hours watching TV, etc …
(from the wikipedia article)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The Central Limit Theorem is this:
As more and more samples are taken from a population the
distribution of the sample means conforms to a normal
distribution
• The average of the samples more and more closely
approximates the average of the entire population
• A very powerful and useful theorem
• The normal distribution is such a common and useful
distribution that additional statistics have been developed to
measure how closely a population conforms to it and to test for
divergence from it due to skewness and kurtosis
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Remember, MUCH of science relies on
making guesses about populations
The CLT helps us make the
guesses reasonable
rather than crazy.
Assuming normal dist, the
stats of a sample tells us
lots about the stats of the
population
And, assuming normal dist helps us detect errors and outliers – how?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Testing for Normality:
the χ2 goodness-of-fit test
This is the classic test of whether a data sample is normally
distributed or not
• We first group our data into k classes so that we can form a
frequency distribution (the number of data items in each class)
• We calculate the mean and standard deviation of our sample and
define a normal distribution based on these values.
• We now need to see if the number of data items in each of our
classes matches the number predicted by the normal distribution
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The normal distribution - with mean mu and std sigma
This tells you how to calculate the probability (frequency) for
any value x
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The goodness of fit test simply measures the
difference between the bars and the curve –
adding up the squared difference for each bar.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
We can also test for skewness and
kurtosis, using higher order
moments
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The take-home lesson (for those
new to statistics)
Your data contains 100 values for x, and you have good reason
to believe that x is normally distributed.
Thanks to the Central Limit Theorem, you can:
– Make a lot of good estimates about the statistics of the population
– Find outliers and spot other problems in the data
It’s better to test for Normality though, and also
test for skewness and kurtosis, so that you can
say: “probably around 0.3% of people use their
mobile for >8 hrs per day, although the sample is
somewhat skewed to the left so this may be an
underestimate …”
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Next week –
an actual Data Mining Algorithm!
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html