What Can You Do With It? Let`s Look at Each Column
Download
Report
Transcript What Can You Do With It? Let`s Look at Each Column
Association Rule Mining
Muhammad Ali Yousuf
http://dsc.itmorelia.edu.mx/~maliyusuf/
[email protected]
Contenidos
Part (A): Maximal frequent itemset,
Generation of Association rules
Part (B): AR Miner Software
Part (C): Horizontal v/s Vertical Database
Layout
Part (D): Frequent Closed ItemSets
2
Rule: 1. If a set of items (itemset) is not frequent, no
superset of this itemset can be frequent.
2. If a set of items (itemset) is frequent, every subset
of this itemset will be frequent.
Now we define the concept of a "Maximal frequent
itemset":
Maximal Frequent itemset is a frequent itemset which
is not a subset of any other frequent set,hence the
term maximal.
3
Another property of a maximal frequent
itemset is that all its subsets will be frequent,
hence if we can modify our Apriori Algorithm
to identify the maximal frequent itemsets , that
will be a great optimisation.
Here we list an example which looks at
extracting a maximal frequent itemset .
4
We have the transactions as follows:
Trans Id
1
A
2
C
3
A
4
A
5
A
6
C
Items
CT W
D W
C T W
C D W
C D T W
D T
5
min_support = 50 % (i.e. there should be
atleast 3 occurences(out of 6) of any pattern to
be considered frequent)
Now we generate the frequent itemset lattice in
a bottom up approach: (We have omitted the
lines showing the formation of k-itemset
(itemsets with k members) for simplicity )
6
7
Explanation
Here we start from the empty set, and go up
each level retaining the itemsets which are
frequent ( atleast 3/6 ooccurences) and
throwing away the rest.
As per the Apriori algorithm we will be
throwing away following itemsets since their
support is < 3.
Itemsets: AD , DT at level 2, and CDT at
level 3
8
Rule
At each level, whenever we see a frequent
itemset to be a superset of another frequent
itemset at lower level, we ignore or rule out
that subset (itemset), since we know that the
superset by definition will include that subset.
Hence while generating the association rules
we will consider all the proper subsets of this
frequent itemset (which will be the maximal
one).
9
Rule
E.g at level 3, the itemset "ACT" is formed
by the grouping of AC and AT, hence we can
rule out itemsets "AC" and "AT" from the
lower level;
If we continue this process, we will be left
with only 2 frequent itemsets (maximal)
:ACTW and CDW
10
Rule
To summarise what we have done so far:
Step 1:Find frequent itemsets. We have found
the maximal frequent itemsets , all possible
subsets of this will be frequent
Step 2: Generate all possible association rules
(A) Look at all frequent itemsets
(B) For each frequent itemset 'X' look at all its
proper subsets 'Y',such that 'Y' is not empty or
equal to 'X'.
11
Observation: For each frequent k-itemset , there will
be (2^k) - 2 proper subsets.
Hence those many rules to be tested.
To test a rule : Y ==> X - Y
i.e. to check if the
confidence of the rule >= min_confidence
Lets take an e.g.:
LET X = CDW k = 3
Number of proper subsets = 2^3 - 2 = 6
to generate all possible association rules with their
confidence:
12
13
Thus we have seen the need to identify
maximal frequent itemsets and how to generate
all possible association rules from it
14
Introduction to AR
Miner
Introduction to Data Storage
We will take a small example of supermarket data.
Suppose the items sold by a (very very small) shop
are green apples, red apples, oranges, bananas, and
grapes.
Also suppose that in this morning you had three
customers, one bought green apples and grapes, one
bought only oranges, and the last one bought oranges
and grapes.
16
This activity can be represented in the .asc format
as follows:
1 green apples
2 red apples
3 oranges
4 bananas
5 grapes
BEGIN_DATA
15
3
35
END_DATA
17
There are two distinct parts of this file, the
first one contains a listing of all the items
you can sell, or otherwise said, of all the
items that could participate in a transaction.
This part looks is:
1 green apples
2 red apples
3 oranges
4 bananas
5 grapes
18
The format is pretty simple. It must consist
of a positive number followed by a string
(which can contain blank spaces). It is
important that the numbers be assigned in
increasing order starting from 1.
Empty lines are allowed to appear in this
section. This section enumerates all entities
described by the data and between which
ARMiner will later be used to look for
association rules.
19
The second part consists of the actual data:
BEGIN_DATA
15
3
35
END_DATA
20
In our case we had 3 transactions and these are
each represented on a separate line. The first
transaction involved green apples and grapes
and they are represented by the numbers
associated in the first section, that is 1 for green
apples and 5 for grapes.
Note that this section must be enclosed between a
BEGIN_DATA and END_DATA lines. Anything
appearing after the END_DATA line will be
ignored.
Blank lines are allowed to appear in this section.
21
Note that although the numbers appearing
in each line are sorted this is not required by
the format.
You can list the numbers in any order and
the file can still be processed correctly,
however we suggest to always list the
numbers in a transaction in increasing order,
this way the processing of the file by asc2db
will be done more efficiently.
22
Census data
SSN# Age
006
26
345
54
743
37
Sex
M
F
M
Married Num_kids Income
No
0
25000$
Yes
2
55000$
Yes
1
80000$
23
What Can You Do With It? Let's
Look at Each Column:
SSN#: this is unique for each entry, there is
no sense to look for association rules
involving SSN#, at least not in this data,
since each SSN# appears only once in the
whole data. So we can simply ignore this
field for mining purposes.
24
What Can You Do With It? Let's
Look at Each Column:
Age: this attribute can take a variety of
values. ARMiner cannot handle such
attributes easily, in fact it only considers
binary attributes. We need to discretize this
attribute, replacing for example ages 0-21
with "very young age", 22-35 with "young
age", 35-55 with "middle age", etc
25
What Can You Do With It? Let's
Look at Each Column:
Sex: this has two values: "male" and
"female", so we could create two attributes
out of it.
26
What Can You Do With It? Let's
Look at Each Column:
Married: again we can create two attributes:
"married" and "not married"
27
What Can You Do With It? Let's
Look at Each Column:
Num_kids: this also has to be discretized,
maybe in "no kids", "one kid", "several
kids".
28
What Can You Do With It? Let's
Look at Each Column:
Income: we could also discretize this into
"small", "average", and "high".
29
What Can You Do With It? Let's
Look at Each Column:
The discretization should be made such that
it will identify clearly the ranges that
present interest for the person who will do
the mining of this data.
30
What Can You Do With It? Let's
Look at Each Column:
With these changes we could represent the
above data in .asc format as:
31
1 very young age
2 young age
3 middle age
4 old age
5 male
6 female
7 married
8 not married
9 no kids
10 one kid
11 several kids
12 small income
13 average income
14 high income
BEGIN_DATA
2 5 8 9 12
3 6 7 11 13
3 5 7 10 14
END_DATA
32
How To Start AR Miner
For a quick start, first launch the server:
java -jar Server.jar
and then launch the client application:
java -jar Client.jar
You can login as admin with password renimra.
33
Support and Confidence
Support: The fraction of transactions T
supporting an itemset X with respect to
database D is called the support of X,
supp(X) = |{T D | X T }| / |D |
34
Support and Confidence
The support of a rule X => Y is defined as,
supp(X=>Y) = supp(X U Y)
35
Support and Confidence
Confidence: The confidence of this rule is
defined as
conf(X=>Y) = supp(X U Y) / supp(X)
36