Formato Base dei Dati - UCLA Computer Science
Download
Report
Transcript Formato Base dei Dati - UCLA Computer Science
Association Rules & Correlations
Basic concepts
Efficient and scalable frequent itemset mining
methods:
Apriori, and improvements
FP-growth
Rule postmining: visualization and validation
Interesting association rules.
1
Rule Validations
Only a small subset of derived rules might
be meaningful/useful
Domain expert must validate the rules
Useful tools:
Visualization
Correlation analysis
2
Visualization of Association Rules: Plane Graph
3
Visualization of Association Rules
(SGI/MineSet 3.0)
4
Pattern Evaluation
Association rule algorithms tend to produce
too many rules
many of them are uninteresting or redundant
confidence(A B) = p(B|A) = p(A & B)/p(A)
Confidence is not discriminative enough criterion
Beyond original support & confidence
Interestingness measures can be used to
prune/rank the derived patterns
5
Application of Interestingness Measure
Interestingness
Measures
6
Computing Interestingness Measure
Given a rule X Y, information needed to compute
rule interestingness can be obtained from a
contingency table
Contingency table for X Y
Y
Y
f11: support of X and Y
X
f11
f10
f1+
f10: support of X and Y
f01: support of X and Y
X
f01
f00
fo+
f00: support of X and Y
f+1
f+0
|T|
Used to define various measures
support, confidence, lift, Gini,
J-measure, etc.
7
Drawback of Confidence
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9 … >0.75
Although confidence is high, rule is misleading
P(Coffee|Tea) = 0.9375 …>>0.75
8
Statistical-Based Measures
Measures that take into account statistical
dependence
Does X lift the probability of Y? i.e.
probability of Y given X over probability
P(Y | X )
of Y.
Lift =
This is the same as interest factor I =1
P(Y )
independence,
P( X , Y )
I> 1 positive association (<1 negative)
=
Interest
P( X ) P(Y )
PS = P( X , Y ) - P( X ) P(Y )
Many other measures
PS: Piatesky-Shapiro
9
Example: Lift/Interest
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
10
Drawback of Lift & Interest
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Y
Y
X
10
0
10
X
0
90
90
10
90
100
0.1
Lift =
= 10
(0.1)(0.1)
Lift
Y
Y
X
90
0
90
X
0
10
10
90
10
100
0.9
Lift =
= 1.11
(0.9)(0.9)
favors infrequent items
Other
criteria proposed Gini,
J-measure, etc.
11
There are lots of
measures proposed
in the literature
Some measures are
good for certain
applications, but not
for others
What criteria should
we use to determine
whether a measure
is good or bad?
What about Aprioristyle support based
pruning? How does
it affect these
measures?
12
Association Rules & Correlations
Basic concepts
Efficient and scalable frequent itemset mining
methods:
Apriori, and improvements
FP-growth
Rule derivation, visualization and validation
Multi-level Associations
Summary
13
Multiple-Level Association Rules
Food
Items often form hierarchy.
Items at the lower level are
bread
milk
expected to have lower support.
Rules regarding itemsets at
2%
wheat white
skim
appropriate levels could be
quite useful.
Fraser Sunset
Transaction database can be
encoded based on dimensions
and levels
TID Items
We can explore shared multiT1 {111, 121, 211, 221}
level mining
T2 {111, 211, 222, 323}
T3
T4
T5
{112, 122, 221, 411}
{111, 121}
{111, 122, 211, 221, 413}
14
Mining Multi-Level Associations
A top_down, progressive deepening approach:
First find high-level strong rules:
milk bread [20%, 60%].
Then find their lower-level “weaker” rules:
2% milk wheat bread [6%, 50%].
Variations at mining multiple-level association rules.
Level-crossed association rules:
2% milk Wonder wheat bread
Association rules with multiple, alternative hierarchies:
2% milk Wonder bread
15
Multi-level Association: Uniform Support vs.
Reduced Support
Uniform Support: the same minimum support for all levels
+ One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum support.
– Lower level items do not occur as frequently. If support
threshold
too high miss low level associations
too low generate too many high level associations
Reduced Support: reduced minimum support at lower levels
There are 4 search strategies:
Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item
16
Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Back
17
Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Back
18
Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor”
relationships between Example
milk wheat bread
[support = 8%, confidence = 70%]
Say that 2%Milk is 25% of milk sales, then:
2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second
rule.
A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
19
Multi-Level Mining: Progressive
Deepening
A top-down, progressive deepening approach:
First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level “weaker” frequent itemsets:
2% milk (5%), wheat bread (4%)
Different min_support threshold across multi-levels
lead to different algorithms:
If adopting the same min_support across multi-levels
then toss t if any of t’s ancestors is infrequent.
If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support is
frequent/non-negligible.
20
Association Rules & Correlations
Basic concepts
Efficient and scalable frequent itemset mining
methods:
Apriori, and improvements
FP-growth
Rule derivation, visualization and validation
Multi-level Associations
Temporal associations and frequent sequences
Other association mining methods
Summary
Temporal associations and frequent sequences
[later]
21
Other Association Mining Methods
CHARM: Mining frequent itemsets by a Vertical Data Format
Mining Frequent Closed Patterns
Mining Max-patterns
Mining Quantitative Associations [e.g., what is the implication
between age and income?]
Constraint-base association mining
Frequent Patterns in Data Streams: very difficult problem.
Performance is a real issue
Constraint-based (Query-Directed) Mining
Mining sequential and structured patterns
22
Summary
Association rule mining
probably the most significant contribution from
the database community in KDD
New interesting research directions
Association analysis in other types of data:
spatial data, multimedia data, time series data,
Association Rule Mining for Data Streams:
a very difficult challenge.
23
Statistical Independence
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(SB) = 420/1000 = 0.42
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) = P(S) P(B) => Statistical
independence
P(SB) > P(S) P(B) => Positively correlated
24
P(SB) < P(S) P(B) => Negatively correlated