Market Basket Analysis/Machine Learning Concepts
Download
Report
Transcript Market Basket Analysis/Machine Learning Concepts
COT5230 Data Mining
Week 2
Data Mining Tasks
MONASH
AUSTRALIA’S
INTERNATIONAL
UNIVERSITY
Data Mining Tasks
2.1
Lecture Outline
Market Basket Analysis
Machine Learning - Basic Concepts
Data Mining Tasks
2.2
Data Mining Tasks 1
Various Taxonomies exist. Berry & Linoff define 6
tasks
– Classification
– Estimation
– Prediction
– Affinity Grouping
– Clustering
– Description
Data Mining Tasks
2.3
Data Mining Tasks 2
The Tasks are also referred to as Operations.
Cabena et al. define 4 Operations
– Predictive Modeling
– Database Segmentation
– Link Analysis
– Deviation Detection
Data Mining Tasks
2.4
Affinity Grouping
Affinity grouping is also referred to as Market
Basket Analysis
A common example is the discovery of which
items are frequently sold together at a
supermarket. If this is known, decisions can be
made about:
– arranging items on shelves
– which items should be promoted together
– which items should not simultaneously be discounted
Data Mining Tasks
2.5
Market Basket Analysis
Rule Body
Confidence
When a customer buys a shirt, in 70% of cases,
he or she will also buy a tie!
We find this happens in 13.5% of all purchases.
Rule Head
Support
Data Mining Tasks
2.6
The Usefulness of Market Basket
Analysis
Some rules are useful: Unknown, unexpected and
indicative of some action to take.
Some rules are trivial: Known by anyone familiar
with the business.
Some rules are inexplicable: Seem to have no
explanation and do not suggest a course of
action.
“The key to success in business is to know
something that nobody else knows”
Aristotle Onassis
Data Mining Tasks
2.7
Co-Occurrence Table
Customer
1
2
3
4
5
OJ
Cleaner
Milk
Cola
Detergent
Items
orange juice (OJ), cola
milk, orange juice, window cleaner
orange juice, detergent
orange juice, detergent, cola
window cleaner, cola
OJ
4
1
1
2
2
Cleaner
1
2
1
1
0
Milk Cola Detergent
1
2
2
1
1
0
1
0
0
0
3
1
0
1
2
Data Mining Tasks
2.8
The Process for Market Basket Analysis
A co-occurrence cube would show associations in
three dimensions - hard to visualize more
We must:
– Choose the right set of items
– Generate rules by deciphering the counts in the cooccurrence matrix
– Overcome the practical limits imposed by many items
in large numbers of transactions
Data Mining Tasks
2.9
Choosing the Right Set of Items
Choosing the right level of detail (the creation of
classes and a taxonomy)
Virtual items may be added to take advantage of
information that goes beyond the taxonomy
Anonymous versus signed transactions
Data Mining Tasks
2.10
What is a Rule?
If condition then result
Note:
If nappies and Thursday then beer
is usually better than (in the sense that it is more actionable)
If Thursday then nappies and beer
because it has just one item in the result
If a 3 way combination is the most common, then consider
rules with just 1 item in the result, e.g.
If A and B, then C
If A and C, then B
Data Mining Tasks
2.11
Is the Rule a Useful Predictor? - 1
Confidence is the ratio of the number of
transactions with all the items in the rule to the
number of transactions with just the items in the
condition. Consider:
if B and C then A
If this rule has a confidence of 0.33, it means that
when B and C occur in a transaction, there is a
33% chance that A also occurs.
Data Mining Tasks
2.12
Is the Rule a Useful Predictor? - 2
Consider the following table of probabilities of
items and there combinations:
Combination
A
B
C
A and B
A and C
B and C
A and B and C
Probability
0.45
0.42
0.40
0.25
0.20
0.15
0.05
Data Mining Tasks
2.13
Is the Rule a Useful Predictor? - 3
Now consider the following rules:
Rule
If A and B then C
If A and C then B
If B and C then A
p(condition) p(condition
and result)
0.25
0.05
0.20
0.05
0.15
0.05
confidence
0.20
0.25
0.33
It is tempting to choose “If B and C then A”,
because it is the most confident (33%) - but there
is a problem
Data Mining Tasks
2.14
Is the Rule a Useful Predictor? - 4
This rule is actually worse than just saying that A
randomly occurs in the transaction - which
happens 45% of the time
A measure called improvement indicates whether
the rule predicts the result better than just
assuming the result in the first place
p(condition and result)
Improvement = p(condition)p(result)
Data Mining Tasks
2.15
Is the Rule a Useful Predictor? - 5
Improvement measures how much better a rule is
at predicting a result than just assuming the result
in the first place
When improvement > 1, the rule is better at
predicting the result than random chance
Data Mining Tasks
2.16
Is the Rule a Useful Predictor? - 6
Consider the improvement for our rules:
Rule
If A and B then C
If A and C then B
If B and C then A
If A then B
support
0.05
0.05
0.05
0.25
confidence
0.20
0.25
0.33
0.59
improvement
0.50
0.59
0.74
1.31
None of the rules with three items shows any
improvement - the best rule in the data actually
has only two items: “if A then B”. A predicts the
occurrence of B 1.31 times better than chance.
Data Mining Tasks
2.17
Is the Rule a Useful Predictor? - 7
When improvement < 1, negating the result
produces a better rule. For example
if B and C then not A
has a confidence of 0.67 and thus an
improvement of 0.67/0.55 = 1.22
Negated rules may not be as useful as the
original association rules when it comes to acting
on the results
Data Mining Tasks
2.18
Strengths and Weaknesses
Strengths
– Clear understandable results
– Supports undirected data mining
– Works on variable length data
– Is simple to understand
Weaknesses
– Requires exponentially more computational effort as
the problem size grows
– Suits items in transactions but not all problems fit this
description
– It can be difficult to determine the right set of items to
analysis
– It does not handle rare items well; simply considering
the level of support will exclude these items
Data Mining Tasks
2.19
Machine Learning
“A general law can never be verified by a finite
number of observations. It can, however, be
falsified by only one observation.”
Karl Popper
The patterns that machine learning algorithms
find can never be definitive theories
Any results discovered must to be tested for
statistical relevance
Data Mining Tasks
2.20
The Empirical Cycle
Analysis
Theory
Observation
Prediction
Data Mining Tasks
2.21
Concept Learning - 1
Example: the concept of a wombat
– a learning algorithm could consider many animals and
be advised in each case whether it is a wombat or not.
From this a definition would be deduced.
The definition is
– complete if it recognizes all instances of a concept ( in
this case a wombat).
– consistent if it does not classify any negative
examples as falling under the concept.
Data Mining Tasks
2.22
Concept Learning - 2
An incomplete definition is too narrow and would
not recognize some wombats.
An inconsistent definition is too broad and would
classify some non-wombats as wombats.
A bad definition could be both inconsistent and
incomplete.
Data Mining Tasks
2.23
Hypothesis Characteristics - 1
Classification Accuracy
– 1 in a million wrong is better than 1 in 10 wrong.
Transparency
– A person is able understand the hypothesis generated.
It is then much easier to take action
Data Mining Tasks
2.24
Hypothesis Characteristics - 2
Statistical Significance
– The hypothesis must perform better than the naïve
prediction. (Imagine if 80% of animals considered are
wombats and the theory is that all animals are
wombats then the theory is right 80% of the time! But
nothing has been learnt.)
Information Content
– We look for a rich hypothesis. The more information
contained (while still being transparent) the more
understanding is gained and the easier it is to
formulate an action plan.
Data Mining Tasks
2.25
Complexity of Search Space
Machine learning can be considered as a search
problem. We wish to find the correct hypothesis
from among many.
– If there are only a few hypotheses we could try them all
but if there are an infinite number we need a better
strategy.
– If we have a measure of the quality of the hypothesis
we can use that measure to select potential good
hypotheses and based on the selection try to improve
the theories (hill-climbing search)
Consider the metaphor of the kangaroo in the
mist.
– This demonstrates that it is important to know the
complexity of the search space. Also that some pattern
recognition patterns are almost impossible to solve.
Data Mining Tasks
2.26
Learning as a Compression
We have learnt something if we have an
algorithm that creates a description of the data
that is shorter than the original data set
A knowledge representation is required that is
incrementally compressible and an algorithm that
can achieve that incremental compression
File-in
Algorithm
File-out
The file-in could be a relation table and the fileout a prediction or a suggested clustering
Data Mining Tasks
2.27
Types of Input Message (File-in)
Unstructured or random messages
Highly structured messages with patterns that are
easy to find
Highly structured messages that are difficult to
decipher
Partly structured messages
– Most data sets considered by data mining are in this
class. There are patterns to be found but the data sets
are not highly regular
Data Mining Tasks
2.28
Minimum Message Length Principle
The best theory to explain a set of data is the one
that minimizes the sum of the length, in bits, of
the description of the theory, plus the length of the
data when encoded with the help of the theory.
Original data set
01100011001001101100011010101111100100110
Theory
00110011000011
Data110001100110000111
set coded with the theory
Put another way, if regularity is found in a data set
and the description of this regularity together with
the description of the exceptions is still shorter
than the original data set, then something of value
has been found.
Data Mining Tasks
2.29
Noise and Redundancy
The distortion or mutation of a message is the
number of bits that are corrupted
making the message longer by including
redundant information can ensure that a message
is received correctly even in the presence of
noise
Some pattern recognition algorithms cope well
with the presence of noise, others do not
We could consider a database which lacks
integrity to contain a large amount of noise
patterns may exist for a small percentage of the
data due solely to noise
Data Mining Tasks
2.30