CSI5388 Practical Recommendations
Download
Report
Transcript CSI5388 Practical Recommendations
CSI5388
Practical
Recommendations
1
Context for our Recommendations I
This discussion will take place in the context
of the following three questions:
• I have created a new classifier for a specific
problem. How does it compare to other existing
classifiers on this particular problem?
• I have designed a new classifier, how does it
compare to existing classifiers on benchmark
data?
• How do various classifiers fare on benchmark
data or on a single new problem?
2
Context for our Recommendations II
These three questions can be translated into
the four different situations:
• Situation 1: Comparison of a new classifier
to generic ones for a specific problem
• Situation 2: Comparison of a new classifier
to generic ones on generic problems
• Situation 3: Comparison of generic
classifiers on generic domains
• Situation 4: Comparison of generic
classifiers on a specific problem
3
Selecting learning algorithms I
The general strategy is to try to select classifiers
that are more likely to succeed on the task at hand.
Situation 1: Select generic classifiers with a good
chance of success at the particular task.
• E.g., For high dimensionality problem: use SVM
as a generic classifier
• E.g., For class imbalanced problem: use SMOTE
as a generic classifier, etc.
Situation 2: Different from Situation 1 in that not
specific problem is targeted. So, choose generic
classifiers that are generally accurate and stable
across domains
• E.g., Random Forests, SVMs, Bagging
4
Selecting learning algorithms II
Situation 3: Different from Situations 1 and 2.
This time, we are interested in finding the
strengths and weaknesses of various algorithms
on different problems. So, select various wellknown and well-used algorithms. Not necessarily
the best algorithms overall.
• E.g., Decision Trees, Neural Networks, Naïve
Bayes, Nearest Neighbours, SVMs, etc.
Situation 4: reduces to Case 1 where what
matters is the search for an optimal classifier or
to Case 3, where the purpose is of a more
general and scientific nature.
5
Selecting Data Sets I
The selection of data sets is different in the cases
of Situations 1 and 4 and Situations 2 and 3.
Situations 1 and 4: We distinguish between two
cases:
• Case 1: There is just one data set of interest –
Just use this data set.
• Case 2: We are considering a class of data sets
(e.g., data sets for text categorization). In this
case, we should look at Situations 2 and 3,
since data sets in the same class can have
different characteristics (e.g., noise, class
imbalances, etc). The only difference is that the
domains in this class will be more closely
related than those in a wider study of the kind
considered in Situations 2 and 3.
6
Selecting Data Sets II
Situations 2 and 3: The first thing that we need to do
is determine what the exact purpose of the study is.
• Case 1: To test a specific characteristic of a new
algorithm or of various algorithms (e.g., their
resilience to noise) – Select domains presenting the
same characteristics
• Case 2: To test the general performance of a new
algorithm or of various algorithms on a variety of
domains with different characteristics — Select
varied domains, but watch the way in which you
report the results. There may be a lot of variance,
from classifier to classifier and type of domain to
type of domain. It will be best to cluster the kinds
of domains on which classifiers excel or do poorly
7
and report the results on a cluster by cluster basis.
Selecting Data Sets III
Situations 2 and 3 (Cont’d): Three
questions remain:
• Question 1: How many data sets are
necessary / desirable?
• Question 2: Where can we get these
data sets?
• Question 3: How do we select data sets
from those available?
8
Selecting Data Sets IV
Situations 2 & 3: How many data sets?
The number of domains necessary depends on the
variance in the performance of the classifiers. As a
rule of thumb, 3 to 5 domains within the same
category of domains are desirable to begin with.
Note: As domains get added, the question raised by
[Salzberg, 1997] and [Jensen, 2001] regarding the
multiplicity effect should be considered.
Situations 2 & 3: Where can we get these data sets?
UCI Repository for machine learning or other
repositories (but the collections may not be
representative of reality).
Directly from the Web (but gathering a cleaning a
data collection is extremely time consuming)
Artificial data sets (easy to build, unlimited in size,
but too far removed from reality)
Real-world inspired artificial data (real-world data
sets artificially augmented. Easy to build, closer to 9
reality)
Selecting Data Sets V
Situations 2 & 3: How do we select data sets from
those available?
Select all those that are available and meet the
constraints of the algorithms that are under study.
For example, the UCI repository contain many data
sets, but only a subset of these are multi-class, only
a subset has nominal attributes only, only a subset
has no missing attributes, and so on.
In order to increase the number of domains
available for use by researchers or practitioners of
Data Mining, some amendments to the data sets can
be made to make as many data sets as possible
conform to the requirements of the classifiers.
10
Selecting performance measures
Cases 2 and 3: Caruana and Niculescu-Mizil, 2004
suggest that the Root mean Squared error is the
best general-purpose method since it is the one that
is best correlated with the other eight measures that
they use. Researchers are, however, encouraged to
use a variety of different metrics in order to discover
the various strengths and shortcomings of each
classifier and each domain more specifically.
Cases 1 and 4: We distinguish between the following
cases:
• Balanced versus imbalanced domains: ROC
• Certainty of the decision matters: B & K
• All the classes matter: RMSE
• The problem is binary but one class matters more
than the other: Precision, Recall, F-measure,
11
Sensitivity, Specificity, Likelihood Ratios.
Selecting an error estimation
method and statistical test I
If the size of the data set is large enough
(the size of all testing sets is, at least, 30)
and if the statistics of interest to the user is
parameterizable: cross-validation can be
tried (but see the next slide).
If the data set is particularly small, i.e., if
some of the testing sets contain fewer than
30 examples: say, if it contains fewer than
30, or so samples: Bootstrapping or
Randomization.
If the statistics of interest does not have
statistical tests associated with it:
Bootstrapping or Randomization.
12
Selecting an error estimation
method and statistical test II
Question: How can one see whether cross-validation
is appropriate for his/her purposes?
2 ways:
Visual: plot the distribution and check its shape
visually
Apply a Hypothesis Test designed to see if the
distribution is normal or not. (e.g. Chi squared
goodness of fit, Kolmogorov-Smirnov goodness
of fit, etc.)
Since no practical distribution will be exactly Normal,
we must also look into the robustness of the various
statistical method considered. The t-test is quite
robust against the normality assumption.
If the distribution is far from normal non-parametric
tests must be used.
13
Selecting an error estimation
method and statistical test III
The robustness of a procedure is important since
that will ensure that the reported significance level
is close to the true one.
However, Robustness does not answer the question
of whether efficient use is made of the data so that
a false null hypothesis can be rejected.
Power should be considered
The power of a test depends on some intrinsic
nature of that test, but also on the shape and size
of the population to which it is applied.
Example: Parametric tests based on the normality
assumption are generally as powerful or more
powerful than non-parametric tests based on ranks
in the case of distribution functions with lighter
tails than the normal distribution.
14
Selecting an error estimation
method and statistical test IV
But: Parametric tests based on the normality
assumption are less powerful than nonparametric ones in the case where the tails of the
distribution are heavier than those of the normal
distribution (An important kind of data presenting
such distributions are data containing outliers).
Note that the relative power of parametric and
non-parametric tests does not change as a
function of sample size, even if a test is
asymptotically distribution free (i.e., if it
becomes more and more robust as the
sample size increases).
15