Transcript slides
G54DMT – Data Mining Techniques and
Applications
http://www.cs.nott.ac.uk/~jqb/G54DMT
Dr. Jaume Bacardit
[email protected]
Topic 1: Preliminaries
Outline of the topic
•
•
•
•
General data mining definitions and concepts
Handling datasets
Repositories of datasets
Experimental evaluation of data mining
methods
• Information theory
• Playing a bit with Weka
Dataset structure
• It most cases we will treat a dataset as a table
with rows and columns
Attributes
Instances
Instances and attributes
• Instances are the atomic elements of
information from a dataset
– Also known as records or prototypes
• Each instance is composed of a certain
number of attributes
– Also known as features or variables
– In a dataset attributes can be of different types
• Continuous (or Integer)
• Discrete (i.e. being able to take a value from a finite set)
Supervised learning
• Many times the datasets got a special
attribute, the class or output
• If they do, the task of the data mining process
consists in generating a model that can predict
the class/output for a new instance based on
the values for the rest of attributes
• In order to generate this model, we will use a
corpus of data for which we already know the
answer, the training set
Example of dataset
Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)
Process of supervised learning
New
Instance
Training
Set
Learning
Algorithm
Models
Inference
Engine
Annotated
Instance
Types of supervised learning
• If the special attribute is discrete
– We call it class
– The dataset is a classification problem
• If the special attribute is continuous
– We call it output
– The dataset is a regression problem
• Also called modelling or function aproximation
Rule Learning
• CN2, RISE, GAssist, BioHEL
1
If (X<0.25 and Y>0.75) or
(X>0.75 and Y<0.25) then
If (X>0.75 and Y>0.75) then
If (X<0.25 and Y<0.25) then
Y
Everything else
0
X
1
A decision tree (ID3/C4.5)
age?
<=30
31..40
overcast
student?
no
no
yes
yes
yes
>40
credit rating?
excellent
fair
yes
Linear Classification (Logistic
Regression)
x
x
x
x
x
x
x
x
x
ooo
o
o
o o
x
o
o
o o
o
o
Support Vector Machines (SVM)
Small Margin
Large Margin
Support Vectors
Unsupervised learning
• When we do not have/not take into account
the class/output attribute
• If the goal is to identify aggregations of
instances
– Clustering problem
• If the goal is to detect strong patterns in the
data
– Association rules/Itemset mining
Clustering (K-Means, EM)
Partitional clustering
http://www.mathworks.com/matlabcentral/fx_files/19344/1/k_means.jpg
Hierarchical clustering
http://www.scsb.utmb.edu/faculty/luxon.htm
Association rules mining (Apriori,
FP-growth)
Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)
Reinforcement learning
• When the system is being given a reward or
punishment whether its prediction was correct
or not
• The DM system is not being told what is
exactly right or wrong, just the reward
• RL methods work prediction by prediction
– This is why many times they are called online
systems
Dataset handling and format
• In DM most datasets are represented as
simple plain text files (or sometimes excel
sheets)
• More sophisticated (and efficient) methods
exist
• We need to decide how to specify the dataset
structure (i.e. the metadata) and the content
(the instances)
ARFF format
• ARFF = Attribute-Relation File Format
• File format from the WEKA DM package
• http://www.cs.waikato.ac.nz/~ml/weka/arff.html
HDF5
• Much more complex file format designed for
scientific data handling
• It can store heterogeneous and hierarchical
organized data
• It has been designed for efficiency
• Presentation slides
Repositories of datasets
• UCI repository
– http://archive.ics.uci.edu/ml/
– Probably the most famous collection of datasets in
ML!
– Currently has 235 datasets
• Kaggle
– It is not a static repository of datasets, but a site
that manages Data Mining competitions
– Example of the modern concept of crowdsourcing
Repositories of datasets
• KDNuggets
– http://www.kdnuggets.com/datasets/
• PSPbenchmarks
– http://www.infobiotic.net/PSPbenchmarks/
– Datasets derived from Protein Structure
Prediction problems
– Interesting benchmarks because they can be
parametrised in a very large variety of ways
• Pascal Large Scale Learning Challenge
– http://largescale.ml.tu-berlin.de/about/
Experimental Evaluation
• We have Data Mining methods and datasets
• How can we know that the patterns they
extract are meaningful?
• How can we know how well do they perform
and which method is the best?
• We need to follow a principled protocol to
make sure that the results and conclusions we
extract are sound
Verifying that the model is sound
• The dataset, before the data mining process, is
partitioned into two non-overlapped parts:
– Training set. Will be used to generate the model
– Test set. Will be used to validate the model
• We can compute the performance metric (more on
the next slides) on both sets. If the metric for the
training set is much higher than in the test set, we
have a case of overlearning
• That is, the DM method has not modelled the
problem, only the training set
Performance metrics: Classification
Problems
• Accuracy: C/D
– C = number of correctly classified examples. D is
size of instance set
– Simplest and most widespread metric
– But what if some classes got more examples than
others? The majority class would get more benefit
out of this metric
Performance metrics: classification
problems
• Cohen’s Kappa
– Computes the agreement between two distributions of
categorical variables (i.e. real and predicted classes)
– Takes into account agreement by chance
– Hence, it may be more suitable for multi-class problems
(Garcia et al., 2009)
C = number of classes, xi= examples from class I
Xii =examples correctly classified from class i
• There are more metrics (e.g. ROC curves)
Performance metrics
• Regression problems
– RMSD
• Metrics for clustering problems
• Association rule mining
– Support (percentage of instances covered by the
pattern)
– Confidence (agreement between predicate and
consequent of the rule)
Performance estimation
methodologies
• We have to partition the dataset into training
and test, but how is this partitioning done?
• If this partitioning is wrong we will not obtain
a good estimation of the method’s
performance
Performace estimation
methodologies: Holdout
• Simply dividing the dataset into two nonoverlaped sets (e.g. 2/3 of the dataset for
training, 1/3 of the set for test).
• Most simple method and computationally
cheap
• The performance is computed on the test set
• Its reliability greatly depends on how the sets
are partitioned
Performace estimation methodologies:
K-fold cross-validation
• Divide the dataset into K strata
• Iterate K times:
– Use sets 1.. K-1 for training and the set K for test
– Use sets 1..K-2,K for training and the set K-1 for test
– etc
• Computer the performance estimation as the average of the
metric for each of the K test sets
• More robust estimator, but computationally more costlier
• If each of the K strata has the same class distribution as the
whole dataset, this method is called stratified K-fold crossvalidation
Performance estimation methodologies
• Leave-one-out cross-validation
– Extreme case of cross-validation, where K=D, the size of
the dataset
– Training sets will have size D-1 and test sets will have only
one instance
– Mostly used in datasets of small size
• Bootstrap
– Generate a sample with replacement from the dataset
with size D. Use it as training set
– Use the instances never selected as test set
– Repeat this process a high number of times
Performance estimation
methodologies
• But which method is best?
• Two criteria to take into account
– Bias: Difference between the estimation and the
true error
– Variance of the estimation’s sample
• Formal and experimental analysis of some of
these methods
• Experimental study for microarray data
Who is the best?
• At the end of an experimentation we will have tested N
methods on D datasets, thus we will have a NxD table of
performance metrics
• How can we identify the best method?
– Highest average performance across the D datasets?
– Highest average rank across the D datasets?
•
•
•
•
Also, is the best method significantly better than the others?
For this we need statistical test (next slide)
Best study nowadays on statistical tests
Most of the tests described in the next slides can be found in
many statistical packages (e.g. R)
Statistical tests
• Procedure for making decisions about data
• Each test defines an hypothesis (H0) about the
observed data
– “The accuracy differences I observe between A and B are
just by pure chance and the methods perform equally”
– Or in a more statistical way “The two sets of data
observations A and B belong to the same distribution”
• Then, the probability of H0 being true given the
observed data is calculated
• If the probability (p-value) is smaller than a certain
threshold, H0 is rejected and the observed
differences are statistically significant
Traditional approach: Student Ttest
• Tests whether the difference between two
distributions is significant or not
• Generates a p-value, in this case the
probability that the two distributions are not
significantly different
• P-values of 0.05 or 0.01 are typical thresholds
• Can only applied if the distributions have the
same variance and are normal. These
conditions are hardly ever achieved
Wilcoxon test
• Non-parametric tests, which is not affected by
the normality of the distribution
• It ranks the absolute differences in
performance, dataset by dataset
• Next, it sums the ranks separately for the
positive and negative differences
• A p-value will be generated depending on
which sum of ranks is smaller and the number
of datasets
Multiple pair-wise comparisons?
• Both the t-test and the Wilcoxon test are used
to compare two methods
• What if we have more than two methods in
our comparison?
– You can hold individually that A is better than B
and better than C with 95% confidence and but it
is better than both B and C at the same time?
– The p-values do not hold anymore
• We need to apply a correction for multiple
tests (e.g. Bonferroni, Holm, etc.)
The Friedman test
• Designed explicitly to compare multiple methods
• Based on the average rank of each method across
the datasets
• This test just says if the performance of the methods
included is similar or not
• Can be used when having more than 10 datasets and
more than 5 methods
• Once the test has determined that there are
significant performance differences, a post-hoc test is
used to spot them
Two types of post-hoc test
• Comparing every method to each other (e.g.
the Nemenyi test)
• Comparing a control method against the
others (e.g. the Holm test)
– E.g. the best method against the others
– This latter kind of test gives less information but
are also more powerful (i.e. less conservative)
• Both are based on ranks
Using R to compute statistical tests
• R is an open source statistical computing
package
• It is extremely powerful, but not the most
simple tool to use in the world….
• It has build-in functions for most of the tests
described in the previous slides
• Example from the command line….
Data for the test
• A plain text file with rows consisting of:
– <method> <dataset> <performance metric>
• Example file
• CV has already been computed
• Let’s load the data in R:
$R
> data<-read.table("rchA.dat")
> data
V1
V2
V3
1 C4_5 rchA_1_q2 0.664884
2 C4_5 rchA_2_q2 0.690657
3 C4_5 rchA_3_q2 0.731942
4 C4_5 rchA_4_q2 0.756022
5 C4_5 rchA_5_q2 0.733984
T-test
> pairwise.t.test(data$V3,data$V1,pool.sd=F,paired=T)
Pairwise comparisons using t tests with non-pooled SD
data: data$V3 and data$V1
C4_5 HEL+ LCS+
HEL+ 0.00056 LCS+ 0.00038 0.53291 NB 0.53291 0.52834 0.53291
P value adjustment method: holm
Wilcoxon test
> pairwise.wilcox.test(data$V3,data$V1,paired=T)
Pairwise comparisons using Wilcoxon rank sum test
data: data$V3 and data$V1
C4_5 HEL+ LCS+
HEL+ 0.00038 LCS+ 0.00032 0.72586 NB 0.72586 0.72586 0.72586
P value adjustment method: holm
> pairwise.wilcox.test(data$V3,data$V1,paired=T)$p.value
C4_5
HEL+
LCS+
HEL+ 0.0003814697
NA
NA
Saving the results to a file
LCS+ 0.0003204346 0.7258606
NA
NB 0.7258605957 0.7258606 0.7258606
> write.table(pairwise.wilcox.test(data$V3,data$V1,paired=T)$p.value,"wilcox.dat”)
Friedman and post hoc tests
• Code for the Holm and Nemenyi post hoc tests
• Code for accuracy (between 0 and 1)
> source("fried+holm.r”)
> fried.holm("rchA.dat")
Friedman rank sum test
data: data$V3, data$V1 and data$V2
Friedman chi-squared = 22.4667, df = 3, p-value = 5.216e-05
[1] "Average ranks"
C4_5 HEL+ LCS+
NB
3.666667 1.722222 2.166667 2.444444
[1] "Control is HEL+"
[1] "Confidence level of 0.950000"
[1] "HEL+ is significantly better than C4_5"
[1] "Confidence level of 0.990000"
[1] "HEL+ is significantly better than C4_5"
Evaluation pipeline (summary)
• N datasets, M methods
– Method can mean anything (mining, preprocessing, etc.)
• For each of the N datasets
– Partition it into training and test sets
– For each of the K pairs of (Trainingkn,Testkn) sets
• For each of the M methods
– model = Train Method M using set Trainingkn
– Metric[N][M][K] = Test model using set Testkn
– Metric[N][M] = Average across k pairs
• Compute average ranks of each method across datasets
• Run statistical tests using Metric matrix
– There are overall statistical significant performance differences?
– Individual tests one-vs-rest or all-vs-all
Basic Information Theory
• Shannon’s Entropy
– Measure of the information content in discrete
data
– Metric inspired by statistical information
transmission
• “How many bits per symbol (in average) would it take
to transmit a message given the frequencies of each
symbol?”
n
H (X) = -å p(xi )log p(xi )
i=1
Shannon’s Entropy
• If we are tossing a coin and want to send a
sequence of tosses
– If heads and tails have the same probability it
would take one bit per toss
– If heads have a very small probability compared to
tails we could just send a list of the heads and say
that everything else is a tail
• Each head message would be long, but there would be
very few of them, so overall it would take less than one
bit to transmit a symbol (in average).
Other information theory metrics
• Conditional Entropy
– Entropy of a certain variable given that we know all about
a related variable
H(Y | X) = å p(x)H(Y | X = x)
xÎX
• Information gain
– How many bits would I save from transmitting Y if I know
about X?
– IG(Y|X) = H(Y) – H(Y|X)
• Mutual information
– Assesses the interrelationship between two variables
æ p(x, y) ö
I(X;Y ) = å å p(x, y)log ç
÷
è p(x)p(y) ø
yÎY xÎX
Why is it useful for in data mining?
• We have a continuous variable where each
data point is associated to a class
– 10 red dots, 10 blue dots
– H(X) = -(0.5log(0.5)+0.5log(0.5))=1
• What if we split this variable in two?
– Where to split?
– The point where we obtain maximum IG.
Conditional variable X is the splitting point
æ Xleft
ö
Xright
IG(X;cutp) = H (X) - çç
H(X;left) +
H (X;right)÷÷
X
è X
ø
Different cut points
• H(X;left) = 0.469, H(X;right) = 0.469, IG=0.531
• H(X;left) = 0, H(X;right) = 0.863, IG=0.384
• H(X;left) = 0.863, H(X;right)=0, IG=0.384
Let’s play a bit with Weka
• Download the WEKA GUI from here
• Dataset for this demo
• What I am going to show are just some of the
steps from this tutorial
Weka from the command line….
=== Confusion Matrix ===
$ java weka.classifiers.trees.J48 -t dataset.arff
a b <-- classified as
39 457 | a = 2
2 494 | b = 3
J48 pruned tree
------------------
att16 <= 1.28221
| att1 <= 1.33404
| | att3 <= 1.37681: 3 (6.0)
| | att3 > 1.37681: 2 (4.0/1.0)
| att1 > 1.33404
| | att17 <= 1.52505
| | | att5 <= 1.34135: 3 (3.0)
| | | att5 > 1.34135
| | | | att8 <= 1.3391: 3 (4.0/1.0)
| | | | att8 > 1.3391: 2 (19.0/1.0)
| | att17 > 1.52505: 2 (18.0)
att16 > 1.28221: 3 (938.0/456.0)
Number of Leaves :
7
Size of the tree :
13
=== Stratified cross-validation ===
Correctly Classified Instances
491
Incorrectly Classified Instances
501
Kappa statistic
-0.0101
Mean absolute error
0.4992
Root mean squared error
0.5111
Relative absolute error
99.8304 %
Root relative squared error
102.214 %
Total Number of Instances
992
=== Confusion Matrix ===
Time taken to build model: 0.11 seconds
Time taken to test model on training data: 0.05 seconds
=== Error on training data ===
Correctly Classified Instances
533
Incorrectly Classified Instances
459
Kappa statistic
0.0746
Mean absolute error
0.4774
Root mean squared error
0.4885
Relative absolute error
95.4706 %
Root relative squared error
97.7091 %
Total Number of Instances
992
53.7298 %
46.2702 %
a b <-- classified as
141 355 | a = 2
146 350 | b = 3
49.496 %
50.504 %
KEEL
• Knowledge Extration using Evolutionary
Learning
• Another data mining platform with integrated
graphical design of experiments
• Download prototype
• Manual
• Instructions about integrating new methods
into it (slides,paper)
Questions?