Formato Base dei Dati - UCLA Computer Science

Download Report

Transcript Formato Base dei Dati - UCLA Computer Science

Association Rules & Correlations
 Basic concepts
 Efficient and scalable frequent itemset mining
methods:
 Apriori, and improvements
 FP-growth
 Rule postmining: visualization and validation
 Interesting association rules.
1
Rule Validations
Only a small subset of derived rules might
be meaningful/useful
Domain expert must validate the rules
Useful tools:
Visualization
Correlation analysis
2
Visualization of Association Rules: Plane Graph
3
Visualization of Association Rules
(SGI/MineSet 3.0)
4
Pattern Evaluation
Association rule algorithms tend to produce
too many rules
many of them are uninteresting or redundant
confidence(A B) = p(B|A) = p(A & B)/p(A)
Confidence is not discriminative enough criterion
 Beyond original support & confidence
Interestingness measures can be used to
prune/rank the derived patterns
5
Application of Interestingness Measure
Interestingness
Measures
6
Computing Interestingness Measure
Given a rule X  Y, information needed to compute
rule interestingness can be obtained from a
contingency table
Contingency table for X  Y
Y
Y
f11: support of X and Y
X
f11
f10
f1+
f10: support of X and Y
f01: support of X and Y
X
f01
f00
fo+
f00: support of X and Y
f+1
f+0
|T|
Used to define various measures

support, confidence, lift, Gini,
J-measure, etc.
7
Drawback of Confidence
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea  Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9 … >0.75
 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375 …>>0.75
8
Statistical-Based Measures
Measures that take into account statistical
dependence
Does X lift the probability of Y? i.e.
probability of Y given X over probability
P(Y | X )
of Y.
Lift =
This is the same as interest factor I =1
P(Y )
independence,
P( X , Y )
I> 1 positive association (<1 negative)
=
Interest
P( X ) P(Y )
PS = P( X , Y ) - P( X ) P(Y )
Many other measures
PS: Piatesky-Shapiro
9
Example: Lift/Interest
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea  Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
10
Drawback of Lift & Interest
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Y
Y
X
10
0
10
X
0
90
90
10
90
100
0.1
Lift =
= 10
(0.1)(0.1)
Lift
Y
Y
X
90
0
90
X
0
10
10
90
10
100
0.9
Lift =
= 1.11
(0.9)(0.9)
favors infrequent items
Other
criteria proposed Gini,
J-measure, etc.
11
There are lots of
measures proposed
in the literature
Some measures are
good for certain
applications, but not
for others
What criteria should
we use to determine
whether a measure
is good or bad?
What about Aprioristyle support based
pruning? How does
it affect these
measures?
12
Association Rules & Correlations
 Basic concepts
 Efficient and scalable frequent itemset mining
methods:
 Apriori, and improvements
 FP-growth
 Rule derivation, visualization and validation
 Multi-level Associations
 Summary
13
Multiple-Level Association Rules
Food
 Items often form hierarchy.
 Items at the lower level are
bread
milk
expected to have lower support.
 Rules regarding itemsets at
2%
wheat white
skim
appropriate levels could be
quite useful.
Fraser Sunset
 Transaction database can be
encoded based on dimensions
and levels
TID Items
 We can explore shared multiT1 {111, 121, 211, 221}
level mining
T2 {111, 211, 222, 323}
T3
T4
T5
{112, 122, 221, 411}
{111, 121}
{111, 122, 211, 221, 413}
14
Mining Multi-Level Associations
 A top_down, progressive deepening approach:
 First find high-level strong rules:
milk  bread [20%, 60%].
 Then find their lower-level “weaker” rules:
2% milk  wheat bread [6%, 50%].
 Variations at mining multiple-level association rules.
Level-crossed association rules:
2% milk  Wonder wheat bread
Association rules with multiple, alternative hierarchies:
2% milk  Wonder bread
15
Multi-level Association: Uniform Support vs.
Reduced Support
 Uniform Support: the same minimum support for all levels
 + One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum support.
– Lower level items do not occur as frequently. If support
threshold
too high  miss low level associations
too low  generate too many high level associations
 Reduced Support: reduced minimum support at lower levels
 There are 4 search strategies:
Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item
16
Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Back
17
Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Back
18
Multi-level Association: Redundancy Filtering
 Some rules may be redundant due to “ancestor”
relationships between Example
milk  wheat bread
[support = 8%, confidence = 70%]
 Say that 2%Milk is 25% of milk sales, then:
2% milk  wheat bread [support = 2%, confidence = 72%]
 We say the first rule is an ancestor of the second
rule.
 A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
19
Multi-Level Mining: Progressive
Deepening
 A top-down, progressive deepening approach:
 First mine high-level frequent items:
milk (15%), bread (10%)
 Then mine their lower-level “weaker” frequent itemsets:
2% milk (5%), wheat bread (4%)
 Different min_support threshold across multi-levels
lead to different algorithms:
If adopting the same min_support across multi-levels
then toss t if any of t’s ancestors is infrequent.
If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support is
frequent/non-negligible.
20
Association Rules & Correlations
 Basic concepts
 Efficient and scalable frequent itemset mining
methods:
 Apriori, and improvements
 FP-growth






Rule derivation, visualization and validation
Multi-level Associations
Temporal associations and frequent sequences
Other association mining methods
Summary
Temporal associations and frequent sequences
[later]
21
Other Association Mining Methods








CHARM: Mining frequent itemsets by a Vertical Data Format
Mining Frequent Closed Patterns
Mining Max-patterns
Mining Quantitative Associations [e.g., what is the implication
between age and income?]
Constraint-base association mining
Frequent Patterns in Data Streams: very difficult problem.
Performance is a real issue
Constraint-based (Query-Directed) Mining
Mining sequential and structured patterns
22
Summary
Association rule mining
probably the most significant contribution from
the database community in KDD
New interesting research directions
Association analysis in other types of data:
spatial data, multimedia data, time series data,
Association Rule Mining for Data Streams:
a very difficult challenge.
23
Statistical Independence
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(SB) = 420/1000 = 0.42
P(S)  P(B) = 0.6  0.7 = 0.42
P(SB) = P(S)  P(B) => Statistical
independence
P(SB) > P(S)  P(B) => Positively correlated
24
P(SB) < P(S)  P(B) => Negatively correlated