Rule Learning - Teaching-WIKI

Download Report

Transcript Rule Learning - Teaching-WIKI

Rule Learning
Intelligent Systems – Lecture 10
©www.sti-innsbruck.at
Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at
1) Motivation (what is the problem solved)
2) Technical solution
• Definitions
• Explanations
• Procedures
• Illustrations by short examples
3) Illustration by a larger example
4) Extensions (Work only sketched)
www.sti-innsbruck.at
Motivation
• Data warehouses and other large scale data collections
provide vaste amounts of data, and hide knowledge that is not
obviously to discover
• Rule learning (associative rule learning) is a popular means
for discovering interesting relations between data sets.
• Rules enable the inference of knowledge and make hidden
facts explicit.
www.sti-innsbruck.at
Motivating Example
• Association rule learning is very popular in marketing:
– People shopping Coke and beer, are very likely to also buy
potato chips: {Coke, beer} ⇒ {potato chips}.
– Retail Advertising and Marketing Association: "For
example, convenience stores realized that new fathers
came in to buy Pampers and they also bought beer. So
where are the Pampers now? Near the beer. It's about
getting a feel for what other things you can merchandise“.
• Other popular applications domains include amongst others:
Web usage, intrusion detection or bioinformatics.
www.sti-innsbruck.at
Definition by Agrawal
• Let I = {i1, i2, ..., in} be a set of binary attributes called items,
and let D = {t1, t2, ..., tn} be a set of transactions called
database.
• Each transaction (tuple in the database) contains a subset of
the items in I.
• A rule is then defined to be of the form A → C (an implication),
where A, C ⊆ I and X ∩ Y = ∅.
• The left-hand side (A) is called antecedent, the right-hand
side (C) is referred to as consequent.
www.sti-innsbruck.at
Example
• [Simple example from Wikipedia about milk, butter, bread and beer]
www.sti-innsbruck.at
Significance of Rules
• Association rule learning only makes sense in the context of
very large data sets.
• In very large data sets there are obviously hundreds if not
thousands of implications discoverable.
• Significance and interest in a rule is therefore an important
selection criteria, and only those rules that represent a bigger
share of the whole can be considered relevant.
• The support of an itemset A is defined as the proportion of
transactions ti ⊆ D that contain A.
www.sti-innsbruck.at
Confidence
• The confidence in a rule depends on the support of the
itemset A and the support of the union of A and C.
• Confidence: conf(A → C) = supp(A ⋃ C) / supp(A).
• The confidence is an estimated measure of probability,
formally known as P(C|A), and gives an indication of the
probability that the consequent holds if the antecedent is
given.
www.sti-innsbruck.at
Finding Rules
• Association rules need to satisfy a given level of confidence,
and a given degree of support the same time.
• A two-step process is generally applied to discover rules that
satisfy both requirements:
– Mininmal support is used to determine frequent itemsets
– Minimum confidence tresholds are applied to determine
the rules.
• The first step is significantly more challenging than the second
one!
www.sti-innsbruck.at
Finding Frequent Datasets
• The number of possible datasets is given by the powerset of I,
and is thus equal to: 2n – 1, with n = | I |.
• Consequently, the number of potential datasets grows
exponentially in the size of I.
• Different algorithms allow nontheless to compute the frequent
datasets efficiently: Apriori (BFS), Eclat (DFS), FP-Growth
(FP-tree).
• These algorithms exploit the downward-closure property
(aka anti-monotonicity): frequent itemset have frequent
subsets, which results in infrequent itemset to have infrequent
supersets.
www.sti-innsbruck.at
Alternative Indicators
• All-confidence: all rules that can be derived from itemset A
have at least a confidence of
all-confidence(A) = supp(A) / max(supp(a ∈ A))
with max(supp(a ∈ A)) the support of the item with the highest
support in A.
• Collective strength is given by
cs(A) = (1-v(A))/(1-E[v(A)]) * E[v(A)]/v(A)
with v(Z) the violation rate and E[] the expected value for
independent items; the violation rate is defined as the fraction
of transactions which contain some of the items in an itemset
but not all. Collective strength gives 0 for perfectly negative
correlated items, infinity for perfectly positive correlated items,
and 1 if the items co-occur as expected under independence.
www.sti-innsbruck.at
Alternative Indicators (2)
• Coverage (aka antecedent support) measures how often a
rule is applicable in a database:
coverage(A → C) = supp(A) = P(A)
• The conviction of a rule indicates the ratio of appearences of
the antecedent A without C being a consequence of A:
conv(A → C) = 1 - supp(C) / 1 - conf(A → C).
• Leverage measures the difference of A and C appearing
together in the data set and what would be expected if A and
C where statistically dependent:
leverage(A → C) = P(A and C) - (P(A)P(B))
www.sti-innsbruck.at
Alternative Indicators (3)
• The lift of a rule is given as the ratio between the confidence
and the pure chance of observing an observation, and thus
measures how many times more often A and C occur together
than expected if they where statistically independent:
lift(A → C) = conf(A → C) / supp(C)
= supp(A ⋃ C) / supp(A) * supp(C).
www.sti-innsbruck.at
Decision Trees as Rules
• Rules can be represented by a decision tree that takes the
general form
if item1 then subtree1
elseif item2 then subtree2
elseif...
• There are as many leaf nodes in a decision tree, as there are
rules, and thus the tree represents a whole set of rules.
www.sti-innsbruck.at
Decision-Driven Rules
• The following definition apply to rules that aim to conclude a
fact out of a given set of attribute value assignments.
• The decision tree thus takes the following form
if attribute1 = value1 then subtree1
elseif attribute1 = value2 then subtree2
elseif...
• The critical question is then: which attribute should be the first
one to evaluate, i.e. which attribute is the most selective
determiner and should be the first one in the decision tree.
www.sti-innsbruck.at
Entropy
• Entropy is a measure of ‚degree of doubt‘ and is a well-studied
concept in information theory.
• Let {c1, c2, ..., cn} be a set of conclusions C of a rule
(consequences); let {x1, x2, ..., xn} be a set of possible values of an
attribute X.
• The probability that ci is true given that X has value xj is given by
p(ci|xj).
• Entropy is then defined as
entropy = - Σ p(ci|xj) log2[p(ci|xj)] for i ∈ 1...|C|
• The logarithm (- log2[p(ci|xj)]) indicates the ‚amount of information‘
that xj has to offer about the conclusion ci.
www.sti-innsbruck.at
Most Useful Determiner
• The lower the entropy of xj with respect to C, the more
information xj has to offer about C.
• The entropy of an attribute X with respect to C is then given
by - Σ p(xj) Σ p(ci|xj) log2[p(ci|xj)].
• The attribute with the lowest entropy is the most useful
determiner, as it has the lowest ‚degree of doubt‘.
www.sti-innsbruck.at
ID3 Algorithm by Quinlan
1. For each attribute, compute its entropy with respect to the
conclusion;
2. Select the attribute with lowest entropy (say X)
3. Divide the data into separate sets so that within a set, X has a fixed
value (X=x1 in one set, X=x2 in another...)
4. Build a tree with branches:
if X=x1 then ... (subtree1)
if X=x2 then ... (subtree2)
...
5. For each subtree, repeat from step 1.
6. At each iteration, one attribute gets removed from consideration.
STOP if no more attributes are left, or if all attribute values lead to
the same conclusion.
www.sti-innsbruck.at
Example
• [Robinson Crusoe example: what is good to eat...]
www.sti-innsbruck.at
Counter Example
• Consider the following data:
– X=3, Y=3 ⇒ yes
– X=2, Y=1 ⇒ no
– X=3, Y=4 ⇒ no
– X=1, Y=1 ⇒ yes
– X=2, Y=2 ⇒ yes
– X=3, Y=2 ⇒ no
• The entropy-based ID3 algorithm is incapable to spot the
obvious relationship if X = Y then yes else no, as only one
attribute is considered at the time!
www.sti-innsbruck.at
www.sti-innsbruck.at