Slide 1 - Homepages | The University of Aberdeen

Download Report

Transcript Slide 1 - Homepages | The University of Aberdeen

Issues with Data Mining
Data Mining involves
Generalization
• Data mining (Machine Learning) learns
generalizations of the instances in training
data
– E.g. a decision tree learnt from weather data
captures generalizations about the prediction of
values for Play attribute
– This means, generalizations predict (or describe)
the behaviour of instances beyond the training
data
– This in turn means, knowledge is extracted from
raw data using data mining
• This knowledge drives end-user’s decision
making process
2
Generalization as Search
• The process of generalization can be viewed
as searching a space of all possible patterns
or models
– For a pattern that fits the data
• This view provides a standard framework for
understanding all data mining techniques
• E.g decision tree learning involves searching
through all possible decision trees
– Lecture 4 shows two example decision trees that
fit the weather data
– One of them is a better generalization than the
other (Example 2)
3
Bias
• Important choices made in a data mining
system are
– representation language– the language chosen to
represent the patterns or models,
– search method – the order in which the space is
searched
– model pruning method– the way overfitting to the
training data is avoided
• This means, each data mining scheme involves
– Language bias
– Search bias
– Overfitting-avoidance bias
4
Language Bias
• Different languages used for representing patterns
and models
– E.g. rules and decision trees
• A concept fits a subset of training data
– That subset can be described as a disjunction of rules
– E.g classifier for the weather data can be represented as a
disjunction of rules
• Languages differ in their ability to represent
patterns and models
– This means, when a language with lower representation ability
is used, the data mining system may not achieve good
performance
• Domain knowledge (external to training data) helps to
cut down search space
5
Search Bias
• An exhaustive search over the search space is computationally
expensive
• Search is speeded up by using heuristics
– Pure children nodes indicate good tree stumps in decision tree
learning
• By definition heuristics cannot guarantee optimum patterns or
models
– Using information gain may mislead us to select a suboptimal
attribute at the root
• Complex search strategies possible
– Those that pursue several alternatives parallelly
– Those that allow backtracking
• A high-level search bias
– General-to-specific: start with a root node and grow the decision
tree to fit the specific data
– Specific-to-general: choose specific examples in each class and then
generalize the class by including k-nearest neighbour examples
6
Overfitting-avoidance bias
• We want to search for ‘best’ patterns and models
• Simple models are the best
• Two strategies
– Start with the simplest model and stop building model when
it starts to become complex
– Start with a complex model and prune it to make it simpler
• Each strategy biases search in a different way
• Biases are unavoidable in practice
– Each data mining scheme might involve a configuration of
biases
– These biases may serve some problems well
• There is no universal best learning scheme!
– We saw this in our practicals with Weka
7
Combining Multiple Models
• Because there is no ideal data mining scheme, it is
useful to combine multiple models
– Idea of democracy – decisions made based on collective
wisdom
– Each data mining scheme acts like an expert using its
knowledge to make decisions
• Three general approaches
– Bagging
– Boosting
– Stacking
• Bagging and boosting both follow the same approach
– Take a vote on the class prediction from all the different
schemes
– Bagging uses a simple average of votes while boosting uses a
weighted average
– Boosting gives more weight to more knowledgeable experts
• Boosting is generally considered the most effective
8
Bias-Variance Decomposition
• Assume
– Infinite training data sets of the same size, n
– Infinite number of classifiers trained on the above data sets
• For any learning scheme
– Bias = expected error of the classifier even after increasing
training data infinitely
– Variance = expected error due to the particular training set
used
• Total expected error = bias + variance
• Combining multiple classifiers decreases the
expected error by reducing the variance component
9
Bagging
• Bagging stands for bootstrap aggregating
•
– Combines equally weighted predictions from multiple models
Bagging exploits instability in learning schemes
– Instability – small change in training data results in big
change in model
• Idealized version for classifier
– Collect several independent training sets
– Build a classifier from each training set
• E.g learn a decision tree from each training set
– The class of a test instance is the prediction that received
most votes from all the classifiers
• Practically it is not feasible to obtain several
independent training sets
10
Bagging Algorithm
• Involves two stages
• Model Generation
– Let n be the number of instances in the training
data
– For each of t iterations
• Sample n instances with replacement from training data
• Apply the learning algorithm to the sample
• Store the resulting model
• Classification
– For each of the t models:
• Predict class of instance using model
– Return class that has been predicted most often
11
Boosting
• Multiple data mining methods might complement each
other
– Each method performing well on a subset of data
• Boosting combines complementing models
– Using weighted voting
• Boosting is iterative
– Each new model is built to overcome the deficiencies in the
earlier models
• Several variants of boosting
– AdaBoost.M1 – based on the idea of giving weights to
instances
• Boosting involves two stages
– Model generation
– Classification
12
Boosting
• Model generation
– Assign equal weight to each training instance
– For each of t iterations:
• Apply learning algorithm to weighted dataset and store
resulting model
• Compute error e of model on weighted dataset and store
error
• If e=0 or e>=0.5
– Terminate model generation
• For each instance in dataset:
– If instance classified correctly by model:
– Multiply weight of instance by e/(1-e)
• Normalize weight of all instances
• Classification
– Assign weight of zero to all classes
– For each of the t (or less) models:
• Add –log(e/(1-e)) to weight of class predicted by model
– Return class with highest weight
13
Stacking
• Bagging and boosting combine models of the same
type
– E.g. a set of decision trees
• Stacking is applied to models of different types
– Because voting may not work when different models do not
perform comparably well,
– Voting is problematic when two out of three classifiers
perform poorly
• Stacking uses a metalearner to combine different
base learners
– Base learners: level-0 models
– Meta learner: level-1 model
– Predictions of base learners fed as inputs to the meta
learner
• Base learner predictions on training data cannot be
input to meta learner
– Instead use cross-validation results on base learner
• Because classification is done by base learners, meta14
learners use simple learning schemes
Combining models using Weka
• Weka offers methods to perform bagging,
boosting and stacking over classifiers
• In the Explorer, under the classify tab,
expand the ‘meta’ section of the hierarchical
menu
• AdaboostM1 (one of the boosting methods) on
Iris data classifies only 7 out of 150
incorrectly
• You are encouraged to try these methods on
your own
15