Optimization of Association Rules Extraction Through Exploitation of
Download
Report
Transcript Optimization of Association Rules Extraction Through Exploitation of
Optimization of Association Rules
Extraction Through Exploitation
of Context Dependent Constraints
Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta
Dipartimento di Informatica, Università di Torino
Outline
Motivations
Knowledge Discovery from Database (KDD),
Inductive Databases
Incremental Constraint Evaluation
Association Rule Mining
Incremental Algorithms
Constraints properties
Item Dependent Constraints (IDC)
Context Dependent Constraints (CDC)
Incremental Algorithms for IDC and CDC
Performance results and Conclusions
Constraint-Based Mining
Motivations: KDD process and
Inductive Databases (IDB)
KDD process consists of a non trivial extraction of implicit,
previously unknown, and potentially useful information from
data
Inductive Databases have been proposed by Mannila and
Imielinski [CACM’96] as a support for KDD
KDD
is an interactive and iterative process
Inductive
Databases contain both data and inductive
generalizations (e.g. patterns, models) extracted from
the data.
can query the inductive database with an
advanced, ad-hoc data mining query language
users
constrained-based queries
Motivations: Constraint-Based
Mining and Incrementality
Why constraints?
can be pushed in the pattern computation and pruning
the search space;
provide to the user a tool to express her interests (both
in data and in knowledge).
In IDB constraint-based queries are very often a refinement
of previous ones
Explorative process
Reconciling backgroung and extracted knowledge
Why executing each query always from scratch?
The new query can be executed incrementally!
[Baralis et al., DaWak’99]
A Generic Mining Language
R=Q( T, G , I , (M) , )
A very generic constraint-based mining query requests:
extraction from a source table
of set of items (itemsets) (on some schema I)
from the groups of the database (grouping constraints)
satisfying some user defined constraints (mining constraints)
The number of such groups must be sufficient
(user defined statistical evaluation measures, such as support)
In our case R contains association rules
An Example
R=Q(purchase,customer,product,price>100,support_count>=2)
purchase
transaction
customer
product
date
price
quantity
1
1001
hiking boots
12/7/98
140
1
1
1001
ski pants
12/7/98
180
1
3
1001
jacket
17/7/98
300
1
2
2256
col shirt
12/7/98
25
2
2
2256
ski pants
13/7/98
180
1
2
2256
jacket
13/7/98
300
1
4
3441
col shirt
13/7/98
25
3
5
3441
jacket
20/8/98
300
2
Mining Query
R
body
itemset head
{jacket}
ski pants
{ski pants}
ski pants
jacket
{jacket, ski pants}
jacket
support_count
frequency
confidence
3
2/3
2/3
2
2/3
1
2
Incremental Algorithms
We studied an incremental
approach to answer new constraintbased queries which makes use of
the information (rules with support
and confidence) contained in
previous results
We individuated two classes of
query constraints:
item dependent (IDC)
context dependent (CDC)
We propose two newly developed incremental
algorithms which allow the exploitation of past results
in the two cases (IDC and CDC)
Relationships between two queries
We can speed up the execution time of a new query
using results of previous queries. Which previous
queries?
Query equivalence: R1=R2
no computation is needed [FQAS’04]
Query containment: [This paper]
Inclusion: R2 R1 and common elements have
the same statistical measures. R2=C(R1)
Dominance: R2 R1 but common elements do
not have the same statistical measures. R2C(R1)
How can we recongnize inclusion or dominance between
two constraints-based queries?
IDC vs CDC
Item Dependent Constraints (IDC )
Context Dependent Constraints (CDC )
are functionally dependent on the item
extracted
depend on the transactions in the
database
are satisfied for a given itemset either for
all the groups in the database or for none
might be satisfied for a given itemset only
for some groups in the database
if an itemset is common to R1 and R2, it
will have the same support: inclusion
a common itemset to R1 and R2 might
not have the same support: dominance
IDC: price > 150
CDC: qty > 1
transaction
customer
product
date
price
quantity
1
1001
ski pants
12/7/98
140
1
1
1001
hiking boots
12/7/98
180
1
2
1001
jacket
17/7/98
300
22
2
2256
col shirt
12/7/98
25
2
2
2256
ski pants
13/7/98
140
22
3
2256
jacket
13/7/98
300
1
4
2256
col shirt
13/7/98
25
3
4
2256
jacket
20/8/98
300
22
Incremental Algorithm for
IDC
Previous query
Current query
Q1
…..
…..
Constraint: price > 5
Constraint: price >10
…..
Rules in memory
Q2
…..
Item Domain Table
BODY
R1
A
A
…
HEAD SUPP CONF
B
C
…
2 1
… …
… …
delete from R1 all rules containing item C
BODY
R2
A
…
HEAD SUPP CONF
B
…
2 1
… …
(R2=P(R1))
Fail
item category
A hi-tech
B hi-tech
C housing
price
12
14
8
item C belongs to a row that
does not satisfy the new IDC
constraint
Incremental Algorithm for
CDC
Previous query
Q1
Q2
…..
…..
Constraint: qty > 5
…..
Rules in
memory
BODY
R1
…
HEAD
SUPP
CONF
…
…
…
…
read the DB
find groups
-in which new
constraints are satisfied
-containing items
belonging to BHF
update support counters in BHF
…
R2
Constraint: qty >10
…..
build BHF
BODY
Current query
HEAD
SUPP
…
…
CONF
…
Body-Head Forest (BHF)
a (4)
f
gmg (3)
rule:
body (head) tree contains
itemsets which are candidates
for being in the body (head)
part of the rule
an itemset is represented as a
single path in the tree and vice
versa
each path in the body (head)
a fg
tree is associated to a counter
rule support = 3 representing the body (rule)
confidence = 3/4 support
Experiments (1): IC vs CD algorithm
execution time vs
constraint selectivity
execution time vs
volume of previous result
ID algorithm
(a)
(b)
CD algorithm
(c)
(d)
Experiments(2): CARE vs Incremental
(a)
(b)
execution time vs cardinality of previous result
execution time vs support threshold
(c)
execution time vs selectivity of constraints
Conclusions and future works
We proposed two incremental algorithms to constraint-based
mining which make use of the information contained in
previous result to answer new queries.
The first algorithm deals with item dependent constraints,
while the second one with context dependent ones.
We evaluated the incremental algorithms on a pretty large
dataset. The result set shows that the approach reduces
drastically the execution time.
An interesting direction for future research:
integration of condensed representations with these
incremental techniques
the end
questions??
condensed representation
It is well known that the set of association rules can rapidly
grows to be unwieldy, especially as the frequency bound
decreases.
Since most of these rules turn out to be redundant, it is not
necessary to mine rules from all frequent itemsets, but it is
sufficient to consider only the rules among closed frequent
itemsets
In fact, frequent closed itemsets are a small subsets (or
condensed representation) of frequent itemsets without
information loss
For these reason, mining the frequent closed itemsets instead of
frequent itemsets takes great advantages.