DMML3_0910 - Mathematical & Computer Sciences
Download
Report
Transcript DMML3_0910 - Mathematical & Computer Sciences
Data Mining
(and machine learning)
DM Lecture 3: Basic Statistics and Coursework 1
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Communities and Crime
Here is an interesting dataset
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
-- state: US state (by number) -- county: numeric code for county
-- community: numeric code for community -- communityname: community name –
-- fold: fold number for non-random 10 fold cross validation,
-- population: population for community: (numeric - decimal)
-- householdsize: mean people per household (numeric - decimal)
-- racepctblack: percentage of population that is african american (numeric - decimal)
-- racePctWhite: percentage of population that is caucasian (numeric - decimal)
-- racePctAsian: percentage of population that is of asian heritage (numeric - decimal)
-- racePctHisp: percentage of population that is of hispanic heritage (numeric - decimal)
-- agePct12t21: percentage of population that is 12-21 in age (numeric - decimal)
-- agePct12t29: percentage of population that is 12-29 in age (numeric - decimal)
-- agePct16t24: percentage of population that is 16-24 in age (numeric - decimal)
-- agePct65up: percentage of population that is 65 and over in age (numeric - decimal)
-- numbUrban: number of people living in areas classified as urban (numeric - decimal)
-- pctUrban: percentage of people living in areas classified as urban (numeric - decimal)
-- medIncome: median household income (numeric - decimal) –
-- pctWWage: percentage of households with wage or salary income in 1989 (numeric - decimal)
-- pctWFarmSelf: percentage of households with farm or self employment income in 1989
[etc etc etc --- 128 fields altogether]
-- ViolentCrimesPerPop: total number of violent crimes per 100K popuation (numeric - decimal)
GOAL attribute (to be predicted)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Mining the C&C data
Let’s do some basic preprocessing and mining
of these data, to start to grasp whether we
can find any patterns that will predict
certain levels of violent crime
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
etc … about 2,000 instances
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
First: some sensible preprocessing
The first 5 fields are (probably) not useful for
prediction – they are more like “ID” fields
for the record. So, let’s remove them.
There are many cases of missing data here too
– let’s remove any field which has any
missing data in it at all. This is OK for the
C&C data, still leaving 100 fields.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
First: some sensible preprocessing
I downloaded the data. First I converted it to a space-separated form, rather than commaseparated, because I prefer it that way. I wrote an awk script to do this called cs2ss.awk,
here:
http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/cs2ss.awk
I did that with this command line on a unix machine:
awk –f cs2ss.awk < communities.data > commss.txt
Placing the new version in “commss.txt”
Then, I wanted to remove the first 5 fields, and remove any field in which any record
contained missing values. I wrote an awk script for that too:
http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/fixcommdata.awk
and did this:
awk –f fixcommdata.awk < comss.txt > commssfixed.txt
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Normalisation
The fields in these data happen to be already min-max normalised into [0,1]; I wondered
whether it would also be good to z-normalise the fields. So I wrote an awk script for znormalisation, and produced a version that had that done
http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/znorm.awk
awk –f znorm.awk < commsfixed.txt > commssfixedznorm.txt
In these data, the class value is numeric, between 0 and 1, indicating (already normalized)
the relative amount of violent crime in the community in question. To make it easier to
find patterns and relationships, I produced new versions of each dataset where the class
value was either 0 (low) or 1 (high) – 0 in the cases where it had been <= 0.4, and 1
otherwise. I used an awk script for that too, and did some renaming of files, and ended
up with:
• commssfixed.txt two-class
• commssfixedznorm.txt two-class z-normlalised
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Now, I wonder: how good is 1-NN at
predicting the class for these data?
If only using fields 20—30 to work out the distance
values, the answer is:
Unchanged data (in this case, already min-max
normalised to [0,1]): 81.1%
Z-normalised: 81.5%
But note that 81% of the data is class 0 – so if you
always guess “0”, your accuracy will be 81.0%.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Now let’s look at the data in more detail;
some histograms of the fields
Here is the distribution of values in field 6 for
class 0 – it is a “5-bin” distribution.
0.9
0.8
0.7
0.6
0.5
Series1
0.4
0.3
0.2
0.1
0
1
2
3
4
5
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Let’s look at the distributions of field 6 for class 0
and class 1 together (% of pop that is Hispanic)
0.9
0.8
0.7
0.6
0.5
class 0
class 1
0.4
0.3
0.2
0.1
0
1
2
3
4
5
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Field 7 (% of pop aged 12--21)
0.7
0.6
0.5
0.4
class 0
class 1
0.3
0.2
0.1
0
1
2
3
4
5
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Field 8 (% of pop aged 12—29)
0.8
0.7
0.6
0.5
class 0
0.4
class 1
0.3
0.2
0.1
0
1
2
3
4
5
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Field 9 (% of pop aged 16—24)
0.8
0.7
0.6
0.5
class 0
0.4
class 1
0.3
0.2
0.1
0
1
2
3
4
5
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Field 10 (% of pop aged >= 65
0.5
0.45
0.4
0.35
0.3
class 0
0.25
class 1
0.2
0.15
0.1
0.05
0
1
2
3
4
5
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
0.9
0.8
0.7
0.8
0.6
0.7
0.5
0.6
0.7
0.6
0.5
0.4
0.5
class 0
class 0
class 1
0.4
class 1
0.3
class 0
0.4
class 1
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
1
2
3
4
0
1
5
2
3
4
5
1
2
3
4
5
0.8
0.5
0.7
0.45
0.4
0.6
0.35
0.5
0.3
class 0
0.4
class 1
class 0
0.25
class 1
0.2
0.3
0.15
0.2
0.1
0.1
0.05
0
0
1
2
3
4
5
1
2
3
4
Which two fields
seem most useful
for discriminating
between classes
0 and 1?
5
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Fields 6 and 7
Maybe we will get better 1-NN results using only the important fields? 2
is (most often) too small a number of fields, but anyway …
I produced versions of the dataset that had only fields 6, 7 and 100 (these
two, and the class field): I then calculated 1-NN accuracy for these.
Results:
`Unchanged’ version: fields 6 and 7: 70.8% (was 81.1%)
Z-normalised: fields 6 and 7: 70.9% (was 81.5%)
Not very successful! But I didn’t expect that – working with several more
of the important fields would quite possibly give better accuracies, but
may take too much time to demonstrate, or do in your assignment.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework 1
You will do what we just did, for four datasets:
For each one:
1.
2.
3.
4.
5.
6.
7.
8.
Download it (of course), then do some simple preparation
Produce a version of the data set where each non-class field is min-max normalised
(for the Communities and Crime dataset, do z-normalisation instead)
Convert into a two class dataset; do this for both the original and normalised cases.
Calculate the accuracy of 1-nearest-neighbour classification for your dataset; do this
for both original and normalised versions
Generate 5-bin histograms of the distribution of the first five fields, for each of the
two classes.
Write 100—200 words describing how the distributions differ between the two
classes, and describing what you think are the two most important fields for
discriminating between the classes.
Produce a reduced dataset (two versions: original and normalised) which contains
only three fields: the two you considered most important, and the class field.
Repeat step 4, but this time for the reduced datasets.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What you email me
One brief paragraph that tells me how you did it / what tools you used. This
will not affect the marks – I would just like to know.
For step 4: A table that tells me the answers for each dataset, followed by a
paragraph that attempts to explain any differences in performance between
the normalised and original versions, or explains why performance is similar.
For steps 5 and 6: 1 page per dataset; on each page, the 10 histograms, and
the discussion (step 7).
For step 8: sane as step 4.
That must all be done within 6 sides of A4.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
You should know what Z-normalisation
is, so here is a brief lecture on basic
statistics, including that
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Fundamental Statistics Definitions
• A Population is the total collection of all items/individuals/events
under consideration
• A Sample is that part of a population which has been
observed or selected for analysis
E.g. all students is a population. Students at HWU is a
sample; this class is a sample, etc …
• A Statistic is a measure which can be computed to describe a
characteristic of the sample (e.g. the sample mean)
The reason for doing this is almost always to estimate (i.e. make a
good guess) things about that characteristic in the population
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
E.g.
• This class is a sample from the population of students at HWU
(it can also be considered as a sample of other populations – like
what?)
• One statistic of this sample is your mean weight. Suppose that is
65Kg. I.e. this is the sample mean.
• Is 65Kg a good estimate for the mean weight of the population?
•Another statistic: suppose 10% of you are married. Is this a good
estimate for the proportion that are married in the population?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some Simple Statistics
• The Mean (average) is the sum of the values in a sample divided by the
number of values
•
The Median is the midpoint of the values in a sample (50% above; 50%
below) after they have been ordered (e.g. from the smallest to the largest)
•
The Mode is the value that appears most frequently in a sample
•
The Range is the difference between the smallest and largest values in a
sample
•
The Variance is a measure of the dispersion of the values in a sample –
how closely the observations cluster around the mean of the sample
•
The Standard Deviation is the square root of the variance of a sample
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Statistical moments
• The m-th moment about the mean (μ) of a sample is:
1
m
(x )
n xSample
Where n is the number of items in the sample.
• The first moment (m = 1) is 0!
• The second moment (m = 2) is the variance
• (and: square root of the variance is the standard deviation)
• The third moment can be used in tests for skewness
• The fourth moment can be used in tests for kurtosis
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Variation and Standard Deviation
• The variance of a sample is the 2nd moment
1
2
variance
(x )
n xSample
Where n is the number of items in the sample.
square root of the variance is the standard deviation)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Distributions / Histograms
A Normal (aka Gaussian) distribution (image from Mathworld)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
`Normal’ or Gaussian distributions …
• … tend to be everywhere
• Given a typical numeric field in a typical
dataset, it is common that most values are
centred around a particular value (the
mean), and the proportion with larger or
smaller values tends to tail off.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
`Normal’ or Gaussian distributions …
• We just saw this – fields 7—10 were
Normal-ish
• Heights, weights, times (e.g. for 100m
sprint, for lengths of software projects),
measurements (e.g. length of a leaf, waist
measurement, coursework marks, level of
protein A in a blood sample, …) all tend to
be Normally distributed. Why??
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Sometimes distributions are uniform
Uniform distributions. Every possible value tends to be equally likely
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This figure is from: http://mathworld.wolfram.com/Dice.html
One die: uniform distribution of possible totals:
But look what happens as soon as the value is a sum of things;
The more things, the more Gaussian the distribution.
Are measurements (etc.) usually the sum of many factors?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Probability Distributions
• If a population (e.g. field of a dataset) is expected
to match a standard probability distribution then a
wealth of statistical knowledge and results can be
brought to bear on its analysis
• Many standard statistical techniques are based
on the assumption that the underlying
distribution of a population is Normal
(Gaussian)
• Usually this assumption works fine
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
A closer look at the normal distribution
This is the ND with mean mu and std sigma
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
More than just a pretty bell shape
Suppose mean of your
sample is 1.8; and
suppose std of your
sample is 0.12
Theory tells us that if a
population is Normal,
the sample std is a fairly
good guess at the
population std
So, we can say with some confidence, for example, that 99.7% of
the population lies between 1.44 and 2.16
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Remember, MUCH of science relies on
making guesses about populations
The CLT helps us make the
guesses reasonable
rather than crazy.
Assuming normal dist, the
stats of a sample tells us
lots about the stats of the
population
And, assuming normal dist helps us detect errors and outliers – how?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Z-normalisation (or z-score
normalisation)
Given any collection of numbers (e.g. the
values of a particular field in a dataset) we
can work out the mean and the standard
deviation.
Z-score normalisation means converting the
numbers into units of standard deviation.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Simple z-normalisation example
values
2.8
17.6
4.1
12.7
3.5
11.8
12.2
11.1
15.8
19.6
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Simple z-normalisation example
values
2.8
17.6
4.1
Mean: 11.12
STD: 5.93
12.7
3.5
11.8
12.2
11.1
15.8
19.6
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Simple z-normalisation example
values
Mean subtracted
-8.32
2.8
17.6
4.1
Mean: 11.12
STD: 5.93
6.48
-7.02
12.7
1.58
3.5
-7.62
11.8
0.68
12.2
11.1
subtract mean, so
that these are centred
around zero
1.08
-0.02
15.8
4.68
19.6
8.48
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Simple z-normalisation example
values
Mean subtracted
In Z units
-8.32
-1.403
6.48
1.092
-7.02
-1.18
12.7
1.58
0.27
3.5
-7.62
-1.28
11.8
0.68
0.11
2.8
17.6
4.1
12.2
11.1
Mean: 11.12
STD: 5.93
subtract mean, so
that these are centred
around zero
1.08
-0.02
15.8
4.68
19.6
8.48
Divide each
value by the
std; we now
see how usual
or unusual
each value is
0.18
-0.003
0.79
1.43
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The take-home lesson (for those
new to statistics)
Your data contains 100 values for x, and you have
good reason to believe that x is normally
distributed.
Thanks to the Central Limit Theorem, you can:
– Make a lot of good estimates about the statistics of the
population
– Make justified conclusions about two distributions being
different (e.g. the distribution of field X for class 1, and
the distribution of field X for class 2)
– Maybe find outliers and spot other problems in the data
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Next week –
back to baskets -- a classic Data
Mining Algorithm!
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The Central Limit Theorem is this:
As more and more samples are taken from a population the
distribution of the sample means conforms to a normal
distribution
• The average of the samples more and more closely
approximates the average of the entire population
• A very powerful and useful theorem
• The normal distribution is such a common and useful
distribution that additional statistics have been developed to
measure how closely a population conforms to it and to test for
divergence from it due to skewness and kurtosis
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html