Identifying Interesting Association Rules with

Download Report

Transcript Identifying Interesting Association Rules with

Identifying Interesting Association
Rules with Genetic Algorithms
Elnaz Delpisheh
York University
Department of Computer Science and Engineering
April 6, 2016
Data mining
Too much
data
•I = {i1,i2,...,in} is a set of items.
•D = {t1,t2,...,tn} is a transactional database.
•ti is a nonempty subset of I.
•An association rule is of the form AB, where
A and B are the itemsets, A⊂ I, B⊂ I, and
A∩B=∅ .
•Apriori algorithm is mostly used for association
rule mining.
•{milk, eggs}{bread}.
2
Data
Data Mining
Association
rules
Apriori Algorithm
TID
3
List of item IDs
T100
I1,I2,I3
T200
I2, I4
T300
I2, I3
T400
I1,I2,I4
T500
I1, I3
T600
I2, I3
T700
I1, I3
T800
I1, I2, I3, I5
T900
I1, I2, I3
Apriori Algorithm (Cont.)
4
Association rule mining
Too much
data
Data
Data Mining
Too many
association
rules
Association
rules
5
Interestingness criteria
 Comprehensibility.
 Conciseness.
 Diversity.
 Generality.
 Novelty.
 Utility.
 ...
6
Interestingness measures
 Subjective measures
 Data and the user’s prior knowledge are considered.
 Comprehensibility, novelty, surprisingness, utility.
 Objective measures
 The structure of an association rule is considered.
 Conciseness, diversity, generality, peculiarity.
 Example: Support
 It represents the generality of a rule.
 It counts the number of transactions containing both A and B.
7
Drawbacks of objective measures
 Detabase-dependence
 Lack of knowledge about the database
 Threshold dependence
 Solution
 Multiple database reanalysis
 Problem
o Large number of disk I/O
Detabase-independence
8
Genetic algorithm-based learning
(ARMGA )
Initialize population
2. Evaluate individuals in population
3. Repeat until a stopping criteria is met
1.
Select individuals from the current population
B. Recombine them to obtain more individuals
C. Evaluate new individuals
D. Replace some or all the individuals of the current population
by off-springs
A.
4.
9
Return the best individual seen so far
ARMGA Modeling
 Given an association rule XY
 Requirement
 Conf(XY) > Supp(Y)
 Aim is to maximise
10
ARMGA Encoding
 Michigan Strategy
 Given an association k-rule XY, where X,Y⊂I, I is
a set of items I=i1,i2,..., in, and X∩Y=.
 For example
 {A1,...,Aj}{Aj+1,...,Ak}
11
ARMGA Encoding (Cont.)
 The aforementioned encoding highly depends on the length
of the chromosome.
 We use another type of encoding:
 Given a set of items {A,B,C,D,E,F}
 Association rule ACFB is encoded as follows
 00A11B00C01D11E00F
 00: Item is antecedent
 11: Item is consequence
 01/10: Item is absent
12
ARMGA Operators
 Select
 Crossover
 Mutation
13
ARMGA Operators-Select
 Select(c,ps): Acts as a filter of the chromosome
 C: Chromosome
 Ps: pre-specified probability
14
ARMGA Operators-Crossover
 This operation uses a two-point strategy
15
ARMGA Operators-Mutate
16
ARMGA Initialization
17
ARMGA Algorithm
18
Empirical studies and Evaluation
 Implement the entire procedure using Visual C++
 Use WEKA to produce interesting association rules
 Compare the results
19
20