transaction database. - Department of Computer Science, HKU

Download Report

Transcript transaction database. - Department of Computer Science, HKU

M.Phil Probation
Talk
Association Rules Mining of
Existentially Uncertain Data
Presenter : Chui Chun Kit
Supervisor : Dr. Benjamin C.M. Kao.
Presentation Outline

Introduction

What is association rules?
 How to mine association rules from large database?

Probabilistic Data (Uncertain Data)



What is probabilistic/ uncertain data?
Possible World interpretation of uncertain data
Mining frequent patterns from uncertain data

Presents a simple algorithm to mine association rules from uncertain
data
 Identify computational problem



Efficient methods of mining association rules from uncertain data
Experimental Results and Discussions
Conclusion and Future Work
Section 1
Introduction
What is association rule?
Introduction
Transaction
Database
Psychological Symptoms Transaction Database
Mood
Disorder
Anxiety
Disorder
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Patient 1
…
Patient 2
…
Self Destructive
Disorder
…
Transactions

Suppose Peter is a psychologist



Binary Attributes
Either Yes/ No
He has to judge on a list of psychological symptoms to make
diagnosis and give treatments to his patients.
All diagnosis records are stored in a transaction database.
We call each patient record as a transaction, and each
psychological symptom as an attribute with value either
yes or no (i.e. binary attribute), and the collection of
patients’ records as a transaction database.
Introduction
Psychological Symptoms Transaction Database
Mood
Disorder
Anxiety
Disorder
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Patient 1
…
Patient 2
…
Self Destructive
Disorder
…

One day, when Peter is reviewing his patients’ records,
he discovers some patterns of his patients’ psychological
symptoms.


E.g. Patients having “mood disorder” are often associated with
“eating disorder”.
He would like to learn about the associations between
different psychological symptoms from his patients.
Introduction
Psychological Symptoms Transaction Database
Mood
Disorder
Anxiety
Disorder
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Patient 1
…
Patient 2
…
Self Destructive
Disorder
…

Peter may be interested in the following associations
among different psychological symptoms.
Mood disorder => Eating disorder
Mood disorder => Depression
Eating disorder => Depression + Mood disorder
Eating disorder + Depression => Self destructive disorder + Mood disorder
Association Rules
These associations are very
useful information to assists
diagnosis and give
treatments.
Introduction
Too
manycomputer
Thanks
records
 !
scientists
Psychological Symptoms Transaction Database
Mood
Disorder
Anxiety
Disorder
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Patient 1
…
Patient 2
…
Self Destructive
Disorder
…


However, the psychological symptoms database is very
large, it is impossible to analyze the associations by
human inspection.
In Computer Science research, the problem of mining
association rules from transaction database is solved in
1993 by R. Agrawal.

The Apriori Algorithm
Basic algorithm for mining association rules
Introduction
Association Rules
2% support value means
that there are 2% of the
patients in the database
have both psychological
symptoms.

Antecedent
Consequent
60% confidence value
means that 60% of the
patients having Eating
disorder also have
Depression.
Eating disorder => Depression
[Support = 2% , confidence = 60%]
There are two parameters to measure the
interestingness of association rules.
 Support
is the fraction of database transaction that
contains the items in the association rule.

Support shows how frequent is the items in the rule.
 Confidence
is the percentage of transaction that
contains the antecedent also contains the consequent.

Confidence shows the certainty of the rule.
Introduction
Association Rules
Given the transaction database,
find ALL the association rules with
SUPPORT values over 10%, and
CONFINDENCE values over 60% please!

Psychological
Symptoms
Database
Two steps for mining association rules
 Step 1: Find ALL frequent itemsets
 Itemsets are frequent if their supports are over the
user-specified SUPPORT threshold.
 Step
2: Generate association rules from the
frequent itemsets

An association rule is generated if its confidence is
over a user-specified CONFINDENCE threshold.
Introduction
Association Rules
Given the transaction database,
find ALL the association rules with
SUPPORT values over 10%, and
CONFINDENCE values over 60% please!

Psychological
Symptoms
Database
Two steps for mining association rules
 Step 1: Find ALL frequent itemsets
 Itemsets are frequent if their support are over the
user-specified SUPPORT threshold.
 Step
2: Generate association rules from the
frequent itemsets
The overall performance of
For the sake of discussion, let us
 An association rule is generated if its confidence over
mining association rules is
focus on the first step in this talk.
user-specified
determined bya the
first step. CONFINDENCE threshold.
Section 1
Introduction
How to mine frequent itemsets
from large database?
Mining Frequent Itemsets
Problem Definition

Given
 Transaction
database D
with n attributes, and m
transactions.

Each transaction t is a
Boolean vector representing
the presence or absence of
items in that transaction.
 Minimum
support
threshold s.

Find ALL itemsets with
support values over s.
Transaction Database D
I1
I2
I3
I4
I5
…
In
t1
1
0
1
1
1
…
1
t2
1
1
1
0
0
…
1
..
…
…
…
…
…
…
…
tm
0
1
1
0
0
1
1
Brute-force approach




Suppose there are 5 items
in the database. i.e.
A,B,C,D and E.
There are totally 32
itemsets.
Scan the database once to
count the supports of ALL
itemsets together.
If there are n different items,
there will be 2^n itemsets to
count in total.


If there are 20 items, there will
be 1,000,000 itemsets!!!
Computationally infeasible
The Apriori Algorithm
Apriori property : All subsets of a frequent itemset must also be frequent
null
The Apriori algorithm adopts
an iterative approach to
identify infrequent itemsets.
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
No need to count
their frequency.
ABCD
Pruned
supersets
ABCE
ABDE
ABCDE
ACDE
BCDE
The Apriori Algorithm
How it works?
The supports of ALL size-1-candidates are obtained
The
{A}obtaining
Apriori
is infrequent,
Algorithm
byprocedure
obtains
APRIORI
thebyfrequent
by Item
aAfter
SUBSET
FUNCTION
scanning
the supports,
candidates
with the
PROPERTY,
itemsets
iteratively
ALL
until threshold
no
of candidates
{A} must
NOT
are
database
once.
support
over
thesupersets
support
areONLY
large
The
APRIORI-GEN
procedure
generate
those
beitems.
frequent.
generated.
size-(k+1)-candidates which are potentially frequent.
The Apriori algorithm starts from
inspecting ALL size-1 items.
Subset
Function
Candidates
Large
itemsets
{BC}
{BDE}
{A}
{BDE}
{B}
{BD}
{BD}
{CDE}
{B}
{CDE}
{C}
{BE}
{BE}
{C}
{D}
{CD}
{CD}
{D}
{E}
{CE}
{CE}
{E}
{DE}
X
X X X X X
{DE}
Save effort for counting the
supports of pruned itemsets.
X X X X X X X X
X X
Apriori-Gen
X X X
X
Important Detail of Apriori
Subset Function
Candidate itemsets
k=1
Subset
Function
Large itemsets
Apriori-Gen

Subset-Function
k=k+1
 Scan
the database transaction by transaction to
increment the corresponding support counts of the
candidates.
 Generally there are many candidates, the Subset
Function organizes the candidates in a hash-tree data
structure.


Each interior node of the hash-tree contains a hash table.
Each leaf node of the hash-tree contains a list of itemsets
and support counts.
Important Detail of Apriori
How are candidates stored into the hash tree?

Hash tree data structure


Each Interior node contains a hash table.
Leaf nodes of hash-tree contains a list of itemsets and
support counts.
Illustrate how the
candidates are stored into
the hash-tree.
First,
Then,
hash
hash
on
the
on first
the
second
item of item of
2 levels
for
storing
sizeSimilarly,
candidate
{2,4} is
candidate
candidate
{1,2}
{1,2}
Candidate
{1,2}
is
stored
in
this
2-candidates.
hashed and stored in this slot.
leaf node.
Hash Tree
Hash table
Level 0
Level 1
1,4,7
2,5,8
3,6,9
{1,2}
{2,4}
Candidates
1
{1,2}
2
{2,4}
3
{3,6}
4
{1,5}
5
{2,4}
6
{2,4}
…
…
Important Detail of Apriori
How to process a transaction with the hash-tree?

Fitting a transaction into the hash tree.
Enumerate all size-2-subsets
A transaction
with
items
Enumerate
ALL 100
size-2
subsets,
within
the transaction
and
has{1,4}
100C2
=
4950
size-2is one
them.
traverse
the of
hash
tree to
subsets !
increment the corresponding
support counts.
Hash on {1,4} and traverse the
Same procedure
hash tree to search for the
has to repeat for
candidate.
ALL size-2 subsets
of the transaction
and for ALL
Candidate
transactions !
Itemset
Subset
Function
Transaction 1
1
2
…
4
9924
Level 0
Level 1
Support Count
{1,4}
0
1
{1,7}
0
{1,2}
{1,3}
{2,4}
{2,5}
{2,3}
{3,4}
{3,5}
{3,6}
{1,5}
{1,6}
{2,7}
…
…
…
…
…
When the itemset is found,
increment its support count.
Section 2
Probabilistic Data
What is probabilistic data?
Probabilistic Database
How to mine association
rules from uncertain
database? 
Probabilistic Database or
Uncertain Database
Psychological Symptoms Transaction Database
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Mood
Disorder
Anxiety
Disorder
Self Destructive
Disorder
Patient 1
97%
5%
84%
14%
76%
…
9%
Patient 2
90%
85%
100%
86%
65%
…
48%
…


In reality, when psychologists make a diagnosis,
they estimate the likelihood of presence of
each psychological symptom of a patient.
The likelihood of presence of each symptom is
represented in terms of existential
probabilities.
Probabilistic Database
Binary Features

Other areas of probabilistic database
 Pattern
Recognition
Handwriting recognition
 Speech recognition…etc

 Information
Retrieval
 Scientific Database
Feature 1
Feature 2
…
Pattern 1
90%
85%
…
Pattern 2
60%
5%
…
Section 2
Probabilistic Data
Possible World interpretation
of uncertain database
by S. Abiteboul in the paper “On the Representation and
Querying of Sets of Possible Worlds“ in SIGMOD 1987.
Possible World Interpretation
Psychological symptoms database

Example
A database with two
psychological symptoms
and two patients.
 16 Possible Worlds
From the uncertain database,

We
canpossibility
discuss the
one
of the
is that both
supports
of itemsets
patients
are actually
havingofboth
each
individualillnesses.
world.
psychological

Depression
Eating Disorder
Patient 1
90%
80%
Patient 2
40%
70%
1
S1
S2
2
S1
S2
3
S1
S2
P1
√
√
P1
×
√
P1
√
×
P2
√
√
P2
√
√
P2
√
√
4
S1
S2
5
S1
S2
6
S1
S2
P1
√
√
P1
√
√
P1
√
√
P2
×
√
P2
×
×
Each
possibility
a × ×
×
×
√
P2
√is called
P2
World”.
On the “Possible
other hand,
the uncertain
7
S1
S2
8
S1
S2 database
9
S1 also
S2
10
S1
S2
11
S1
S2
captures
the
possibility
×
×
× that P1
×
× eating
×
P1
P1
√
√1 only
P1has
√
P1
√
patient
disorder
×
×
×
P2
√
√
P2
√
P2
√
P2
P2
illness
while
patient
2√ has
both
of ×the √
Thus data uncertainty is eliminated
illnesses.
when we focus12 onS1individual
S2
14
S1
S2
15
S1
S2
13Possible
S1
S2
16
S1
S2
×
×
×
×
×
P1
√
P1
P1
×
×
×
P1
√
P1
World!
P2
P2
×
×
P2
×
×
P2
√
×
Possible World Interpretation
Psychological symptoms database

Support of itemset
{Depression,Eating Disorder}
World
Support {S1,S2}
1
World Likelihood
Depression
Eating Disorder
Patient 1
90%
80%
Patient 2
40%
70%
0.9 × 0.2016
0.8 × 0.4 × 0.7
2
S2
2
S1
S2
3
S1
S2
Question:
×
×
P1
√
√
P1
√
P1
√
3
0.0504
1
Overall
speaking,
how
many
4
0.3024
1
P2
√
√
P2
√
√
P2
√
√
{S1,S2}
itemsets
will
you
5
0.0864
1
6
0.1296
1
expect
to have
from6 these
4
S1
S2
5
S1
S2
S1
S2
7
0.0056
1
we
can
discuss
P1 Similarly,
√ possible
√
P1 worlds?
√
√
P1 the
√
√
8
0.0336
0
×
×
×
×
P2 support
√
P2 likelihood
√
P2
and
of
…
…
0
We can also discuss
the
Possible
2. support of
We canWorld
discuss
likelihood of possible
world
1
7
S1
S2
8
S1
S2
9
S1itemset
S2
10
S1
S2of possible
11
S1
S2
{S1,S2}
being
the true
×
×
× being
×
×
P1world.
P1
√
P1 the
√
P1
√
We define
the
expected
support
world
1.P1 × √
×
×
×
×
P2
√
√
P2
√
P2
√
P2
√
P2
√
weighted average
support
count
represented
Thus data uncertainty is eliminated
by ALL the possible worlds.
when we focus12 onS1individual
S2
14
S1
S2
15
S1
S2
13Possible
S1
S2
16
S1
S2
×
×
×
×
×
P1
√
P1
P1
×
×
×
P1
√
P1
World!
2
P2
×
1
0.1 × 0.8
× 0.4 × 0.7
0.0224
1
×
P2
×
×
P2
S1
√
×
P2
×
√
P2
×
×
Possible World Interpretation
World
Support {S1,S2}
World Likelihood
Weighted Support
1
2
0.2016
0.4032
2
1
0.0224
0.0224
3
1
0.0504
0.0504
4
1
0.3024
0.3024
5
1
0.0864
0.0864
6
1
0.1296
0.1296
7
1
0.0056
0.0056
8
0
0.0336
0
…
0
…
0
Expected Support
Expected
Support
is the is
Notice
that
the world
calculated by
summing
up the
likelihoods
form
a discrete
weighted support
counts of
probability
density
ALL the possible worlds.
function
of the support
values of itemset {S1,S2}.
P(support)
1
We define the expected support being the
Since the possible worlds are
weighted
average
support
count represented
We expect
there
will
be
1 patient
independent
to each
other,
the has
byboth
ALL“Eating
the possible
worlds.
Disorder”
and of the support
probability
density
function
“Depression”.
values
of {S1,S2} is as follows
59.68%
20.16%
0
20.16%
1
2
Support
Possible World Interpretation

Instead of enumerating all “Possible Worlds” to
calculate the expected support, it can be
calculated by the following formula
Psychological symptoms database
Depression
Eating Disorder
Weighted Support
Patient 1
90%
80%
0.72
Patient 2
40%
70%
0.28
TOTAL SUM
1
The expected support can
be calculated by simply
multiplying the existential
probabilities within the
transaction and obtain the
total sum of all transactions
Mining Frequent Itemsets from
probabilistic data

Problem Definition
 Given
an uncertain database D with each item of a
transaction associated with an existential probability,
and a user-specified support threshold s, return ALL
the itemsets having expected support greater than
or equal to |D|×s.
 In another words, find ALL the itemsets that are
expected to be frequent according to the existential
probabilities in the uncertain database.
Section 3
Mining frequent patterns
from uncertain data
The Uncertain Apriori
algorithm
Uncertain Apriori Algorithm

All the procedures are the
same as conventional
association rule mining
algorithm.
Candidate itemsets
Subset
Function
Start
The only difference is
in the subset function.
k=1
Size-k-large itemsets
End
Apriori-Gen
k=k+1
Uncertain Apriori Algorithm
Increment the candidate count by
the expected support contributed
by the transaction.
Subset
Function
Transaction 1
1 (70%)
2 (50%)
4 (30%)
…
9924 (30%)
Level 0
The only difference is
in the subset function.
Level 1
Instead of
storing the
support counts,
candidate
itemsets are
associated with
an expected
support count.
{1,4}
{1,4}
Expected
Support Count
{1,7}
0.21
0
{1,7}
0
Candidate
Itemset
{1,2}
{1,3}
{2,4}
{2,5}
{2,3}
{3,4}
{3,5}
{3,6}
{1,5}
{1,6}
{2,7}
…
…
…
…
…
Increase the expected
support count by 0.7*0.3 =
0.21
Uncertain Apriori Algorithm
Psychological Symptoms Transaction Database
Mood
Disorder
Anxiety
Disorder
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Patient 1
90%
2%
99%
97%
92%
…
5%
Patient 2
89%
96%
80%
4%
8%
…
3%
Patient 3
8%
6%
79%
10%
5%
…
98%
Self Destructive
Disorder
…
Why the algorithm
executes so long,
even doesn’t
terminate ?
Transaction 1
1 (90%)
2 (2%)
3 (99%)
…
9924 (5%)
Level 0
Level 1
Thus we can apply
Uncertain Apriori on
uncertain database to
mine ALL the frequent
itemsets.
{1,4}
{1,2}
{1,3}
{2,4}
{2,5}
{2,3}
{3,4}
{3,5}
{3,6}
{1,7}
{1,5}
{1,6}
{2,7}
…
…
…
…
…
Computational Issue
Psychological Symptoms Transaction Database
Mood
Disorder
Anxiety
Disorder
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Patient 1
90%
2%
99%
97%
92%
…
5%
Patient 2
89%
96%
80%
4%
8%
…
3%
Patient 3
8%
6%
79%
10%
5%
…
98%
Self Destructive
Disorder
…

Each item (attribute) of a transaction (object) is
associated with an existential probability, despite
the items with very high probability of presence,
there are large number of items with relatively
low probability of presence.
Computational Issue
Transaction with some low
existential probability items
Psychological
Symptoms Uncertain
Database
Transaction 1
1 (70%) 2 (50%) 4 (30%) 7 (3%)
10 (2%)
…
991 (60%)
Level 0
Level 1
Candidate
Itemset
Expected
Support Count
{1,4}
0
0.21
{1,7}
0
0.021
{1,10}
0
0.014
{4,7}
0. 0009
{4,10}
0. 0006
{7,10}
0. 00006
{1,2}
{1,3}
{2,4}
{2,5}
{2,3}
{3,4}
{3,5}
{3,6}
{1,5}
{1,6}
{2,7}
…
…
…
…
…
Many
insignificant
This
is the expected support
subset
increments.
contributed by the transaction to
candidates in this leaf node.
If {7,10} turns out to be infrequent after scanning
the database, ALL the subset increments are
redundant.
Computational Issue

Preliminary experiment is conducted to verify the
computational bottleneck of mining uncertain
database.
 In
general, uncertain database will have “longer”
transactions. (i.e. more items per transaction)


 In
Some items with high existential probabilities.
Some items with low existential probabilities.
our current study, we focus on dataset with
bimodal existential probability distribution.
Computational Issue

Synthetic Dataset simulates a bimodal
distribution of existential probability:
7
datasets with same frequent itemsets.
 Vary the percentages of items with low
existential probability in the datasets.
1
0%
2
3
4
5
33.33%
50%
60%
66.67%
6
71.4%
7
75%
Preliminary Study
Number of candidates in each iteration
Time spent on subset checking in each iteration for different datasets
Number of candidate itemsets
Fraction of items with low existential
probability : 75%
7
There is potential to
reduce the execution time.
Iterations
Number of large itemsets
Number of Large itemsets in each iteration
This figure shows the time
spent on subset checking in
each iteration of different
datasets
Computational bottleneck
occurs in iteration 2.
Iterations
There is a sudden burst of number
of
75%
candidates in second iteration.
Since both datasets have the same
frequent itemsets, subset increment
of the 75% low existential probability
ALL datasets areitems
having
the same
maybe
actually redundant.
large itemsets.
Fraction of items with low existential
probability : 0%
1
0%
Iterations
Section 4
Efficient Methods of Mining
Frequent itemsets from
Existentially Uncertain Data
Efficient Method 1
Data Trimming
Avoid insignificant subset
increments
Method 1 - Data Trimming Strategy

Direction
 Try
to avoid incrementing those insignificant
expected support counts.
 Save the effort for
Traversing the hash tree.
 Computing the expected support. (Multiplication of
float variables)
 The I/O for retrieving the items with very low
existential probability.

Method 1 - Data Trimming Strategy

Question: Which item should be trimmed?
 Intuitively,
items with low existential probability
should be trimmed, but how low?

For the time being, let assume there is a
user-specified trimming threshold.
Method 1 - Data Trimming Strategy
Uncertain database
t1
Trimmed Database
I1
I2
I3
…
I4000
90%
80%
3%
…
1%
t2
80%
4%
85%
…
78%
t3
2%
5%
86%
…
89%
t4
5%
95%
3%
…
100%
t5
94%
95%
85%
…
2%


I1
I2
t1
90%
80%
t2
80%
t4
t5
+
95%
94%
Statistics
Total expected
support trimmed
Maximum existential
probability of trimmed item
I1
1.1
5%
I2
1.2
3%
95%
Create a trimmed database and trim out all items with
existential probability lower than the trimming threshold.
During the trimming process, some statistics are kept
for error estimation when mining the trimmed database.



Total trimmed expected support count of each item.
Maximum existential probability of trimmed item.
Other information : e.g. inverted list, signature file …etc
Method 1 - Data Trimming Strategy
Trimming Process
Trimmed Database
Uncertain Database
Trimming
Module
Statistics
+
Candidate itemsets
Subset
Function
Hash-tree
The uncertain database is first passed into the
trimming module to remove the items with low
existential probability and gather statistics
during the trimming process.
Apriori-Gen
k=k+1
Size-k-infrequent
itemsets
Size-k-large
itemsets
+
Size-k-potentially frequent
itemsets
Pruning Module
During the
trimming
process
the
Notice that the infrequent
The
Subset
Function
scans the trimmed
“true” expected
count
of
Then
the size-1
frequent
itemsetssupport
are only
infrequent
Here
comes
two
strategies:
Up
Module
database
and
count
the
expected
support of
size-1 candidates
are counted.
items
arePatch
passed
into
the
The
Pruning
Module
uses
the
statistics
in the trimmed
database.

Use
the
potentially
frequent
itemsets
to
generate
sizeevery
size-2-candidates.
all thei.e.
potentially
frequent
itemsets
are
APRIORI
GENthe
procedure
Size-1-large
itemsets
do not
gathered
from
trimming moduleFinally,
to estimate
k+1-candidates
It may
contains
some
true
checked
against
the
original
database
to
We
expect
mining
the
trimmed
database
to
generate
size-2negative.
the
error
andtheidentifies
potentially
 Do
not itemsets
use
potentially the
frequent
itemsets to generate have false
frequent
itemsets in the
Missed
frequent
verify
its
true
support.
saves
lots
of
I/O
and
Computational
Costs.
candidates.
size-k+1
candidates
frequent itemsets from the infrequent itemsets.
original database.
Method 1 - Data Trimming Strategy
Pruning Module
countrole
represents
the
This
The
of
the Pruning
expected support of the itemset
Module is to
ABidentify
where both item
A and Bitemsets
are
those
which are
left in the trimmed database.
i.e.infrequent
This count can bein
obtained
the bytrimmed database but
mining the trimmed database
frequent in the original database.
Have to be estimated
Method 1 - Data Trimming Strategy
Pruning Module


If upper bound of
plus
is
greater than or equal to the minimum expected
support requirement, {A,B} is regarded as
potentially frequent.
Otherwise, {A,B} cannot be frequent in the
original database and can be pruned.
Have to be estimated
Method 1 - Data Trimming Strategy
Max count pruning strategy

Pruning strategy depends on statistics from the
Trimming Module.

For each size-1 item, keeps


Total expected support count being trimmed of each item.
Maximum existential probability of trimmed item.
Local Statistics
Part a
Part b
Original
Part c
Database
Part d
Part e
Using global counts
to estimate the
Total expected
Maximum existential
whole database is
support
trimmed
probability
of trimmed
I1
Part a – 16.6
Part
a – 2% item
sometime loose, we
Part 1.5
b – 14.2
Part
I1
5%b – 0.5%
may use some
Part1.2
c – 13
Part
I2
3%c – 6%
Part d – 0.1
Part d – 1%
“Local” statistics to
Part e – 2.7
Part e – 0.7%
obtain the bound
I2
Part a – 2.7
Part a – 1.1%
Since Part
theb –statistics
are Part
“Global”
to the whole
Local Max Count
19.5
b – 3%
database,
Part c –this
2.6 method is
Partcalled
c – 7%
Pruning Strategy
Part d – 12.3
Part d – 2.4%
Global
Max CountPart
Pruning
Strategy
Part e – 0.3
e – 0.2%
Total expected
support trimmed
Maximum existential
Global
Statistics
probability of trimmed item
SKIP
Method 1 - Data Trimming Strategy
Max count pruning strategy


Let
,
and
be the upper bound estimations of
and
respectively.
From iteration 1, we have
,
Method 1 - Data Trimming Strategy
Patch Up Module
Trimmed Database
Uncertain Database
Trimming
Module
Statistics
+
The Pruning Module identifies a set
of potentially frequent itemsets.
Candidate itemsets
Subset
Function
Hash-tree
The Patch Up Module verifies the
true frequencies of the potentially
frequent itemsets. Apriori-Gen
Size-k-infrequent
itemsets
Size-k-large
itemsets
+
Size-k-potentially frequent
itemsets
k=k+1
Patch Up Module
Missed frequent itemsets
Two strategies
• One Pass Patch Up Strategy
• Multiple Passes Patch Up Strategy
Pruning Module
Method 1 - Data Trimming Strategy
Determine trimming threshold
Trimmed Database
Uncertain Database
Trimming
Module
Statistics
+
Size-k-infrequent
itemsets
Size-k-large
Question : Which item should be
trimmed?
itemsets
Candidate itemsets
Subset
Function
Hash-tree
Apriori-Gen
k=k+1
Patch Up Module
Missed frequent itemsets
+
Size-k-potentially frequent
itemsets
Pruning Module
Method 1 - Data Trimming Strategy
Determine trimming threshold
Before scanning the database and
incrementing the support counts of
candidates, we cannot deduce which
itemset is infrequent.
 We can make a guess on the trimming
threshold from the statistics gathered from
previous iterations.

Method 1 - Data Trimming Strategy
Determine trimming threshold
Statistics from previous
iteration
Cumulative Support of item A in descending order
 Order
the existential
probabilities of each
size-1 item in
descending order and
plot the cumulative
support.


E.g. Item A has it’s
expected support just
over the support
threshold.
It is marginally frequent,
it’s supersets are
potentially infrequent.
Cumulative Support

Use the existential probability of
the intersecting item to be the
trimming threshold.
If a superset is infrequent, it won’t be
frequent in trimmed database, we want
existential
probability
to item
trimA ordered
thosebyitems
such
thatin descending
the errororder
estimation should be tight enough to
prune it in the Pruning Module.
Method 1 - Data Trimming Strategy
Determine trimming threshold
Statistics from previous
iteration
Cumulative Support of item B in descending order
 Order
the existential
probabilities of each
item in descending
order and plot the
cumulative support.


E.g. Item B has it’s
expected support much
larger than the support
threshold.
It’s supersets are likely
to be frequent.
Cumulative Support

Use the existential
probability of this item to be
the trimming threshold.
item B ordered by existential probability in descending order
The expected support
contributed by these items
are insignificant.
Efficient Method 2
Decremental Pruning
Identify infrequent candidates
during database scan
Method 2 - Decremental Pruning


In some cases, it is possible to identify an itemset to be
infrequent before scanning the whole database.
For instance, if the minimum support threshold is 100,
and the expected support of item A is 101.

After scanning transaction t2, we can conclude that ALL itemsets
containing item A must be infrequent and can be pruned.
Total expected support of A is 100.3 from
transaction t2 onwards.
Total expected support of A is 99.8 from
transaction t3 onwards.
We can conclude that Item A is infrequent
from t2 to t100K, all candidates containing A
must be infrequent.
Uncertain Database
A
…
t1
70%
0%
t2
50%
0%
…
…
….
t100K
…
…
Method 2 - Decremental Pruning

Before scanning the database, define two “Decremental
Counters” for itemset {A,B}

represents the maximum possible support
count of itemset {A,B} if


ALL items A match with item B, and
ALL matching item Bs are having 100% existential probabilities
from transaction t to the end of the database, then
itemset {AB} will have support count larger than the
minimum support by ”
“.
Method 2 - Decremental Pruning

While scanning the transactions, update the decremental
counts according to the following equation :
Method 2 - Decremental Pruning
Brute-force method

Example
Uncertain Database

Support threshold: 50%, min_sup=2
 Expected support of A=2.6, B=2.1, C=2.2
 For candidate itemset {A,B} :
A
B
C
T1
100%
50%
30%
T2
90%
80%
70%
T3
30%
40%
90%
We can conclude that candidate {A,B} is
T4
40%
40%
30%
infrequent without scanning T3 and T4, which
saves the computational efforts in the
0.6 subset
value of d0(A,AB) means that if
function.
- ALL the item A match with item B and,
- ALL matching Bs are having 100%
existential probabilities in the whole database,
0.1 value means that if
the support
decremental
then theUpdate
expected
countcounters
of {A,B} will
Before scanning the database,
1)
ALL the item A match with item B, and
according
the equation.
be 0.6 larger
thanto
min_sup
.
initialize
the
decremental
2)
ALL matching Bs are having 100% existential probabilities
counters
of candidate {A,B}
from transaction 2 to 4, then the expected support count of {A,B} will be 0.1 larger than min_sup.
Method 2 - Decremental Pruning
Brute-force method

This method is infeasible because


Each candidate has to associate with at least 2 decremental
counters.
Even if any itemset is identified infrequent, the subset function
still has to traverse the hash tree and reach the leaf nodes to
retrieve the corresponding counters before it is known to be
infrequent.
Level 0
Level 1
Candidate
Itemset
Expected
Support Count
Decremental
Counters
AD
0
d0(A,AD),d0(D,AD)
AG
0
d0(A,AG),d0(G,AG)
AB
AC
BD
BE
BF
CD
CE
CF
AE
AF
BG
…
…
…
…
…
Method 2 - Decremental Pruning
Aggregate by item method

Aggregate by item method
 Aggregates
the decremental counters and obtains
an upper bound of them.
Brute-force method
Suppose there are three
size-2-candidates
Aggregate by item method
Aggregate the counters d0(A,AB) and d0(A,AC)
There
are totally
6 decremental
by d0(A),
and obtain
an upper bound of the
counters
in
the
brute-force
method
two counters.
SKIP
Method 2 - Decremental Pruning
Aggregate by item method
Aggregated
Counter
Value
0.6-[1*(1-0.5)]
Since
is smaller
Sinced2(A)
no counter
is
Scan
transaction
t2
and
than
zero,
{AB},{AC}
smaller
than
zero,t1we
Scan
transaction
and
Initialize
the
counters
update
the
decremental
are
infrequent
and
can
cannot
conclude
any
Update the decremental
counters
be
pruned to be
candidates
counters
infrequent.
Uncertain Database
A
B
C
T1
100%
50%
30%
T2
90%
80%
70%
T3
30%
40%
90%
T4
40%
40%
30%
Method 2 - Decremental Pruning
Hash-tree integration method

Other than loosely aggregate the decremental
counts by item, aggregation can be based on
the hash function used in the subset function.
Recall that the brute-force
approach stores the
decremental counters in the
leaf nodes.
Subset Function
Level 0
Level 1
Candidate
Itemset
Expected
Support Count
Decremental
Counters
AD
0
d0(A,AD),d0(D,AD)
AG
0
d0(A,AG),d0(G,AG)
DG
0
d0(D,DG),d0(G,DG)
AB
AC
BD
BE
BF
CD
CE
CF
AE
AF
BG
…
…
…
…
…
DE
DF
EG
Method 2 - Decremental Pruning
Hash-tree integration method
When any of the decremental counters
become smaller than or equal to zero,
the corresponding itemsets in the leaf
node cannot be frequent and can be
pruned.
This is the root of
the hash tree.
The aggregated decremental counters
are stored in the hash nodes.
Subset Function
The hash-tree integration
method aggregates the
decremental counters
according to the hash
function.
Level 0
Level 1
Candidate
Itemset
Expected
Support Count
Decremental
Counters
AD
0
d0(A,AD),d0(D,AD)
AG
0
d0(A,AG),d0(G,AG)
DG
0
d0(D,DG),d0(G,DG)
AB
AC
BD
BE
BF
CD
CE
CF
AE
AF
BG
…
…
…
…
…
DE
DF
EG
Method 2 - Decremental Pruning
Hash-tree integration method

Improving the pruning power
 The
hash-tree is a prefix tree which is constructed
based on lexicographic order of items
 Item with higher order will be prefix containing more
itemsets.
If this counter becomes
negative during database
scan, we can prune 3
itemsets.
3 itemsets under
this decremental
counter.
If this counter
becomes negative
during database
scan, we can prune
1 itemset only.
Level 0 (Root)
{A,B}
{B,C}
{A,C}
{B,D}
{A,D}
{C,D}
1 itemset under
this decremental
counter only.
Method 2 - Decremental Pruning
Hash-tree integration method

Our strategy is to reorder the items by their expected
supports in ascending order such that

The decremental counters of items in higher lexicographic orders
will be more likely to become negative than those with lower
lexicographic orders.
If this counter becomes
negative during database
scan, we can prune 3
itemsets.
3 itemsets under
this decremental
counter.
If this counter
becomes negative
during database
scan, we can prune
1 itemset only.
Level 0 (Root)
{A,B}
{B,C}
{A,C}
{B,D}
{A,D}
{C,D}
1 itemset under
this decremental
counter only.
SKIP
Efficient Method 3
Candidates filtering
Identify infrequent candidates
before database scan
Method 3 – Candidates filtering

It is possible to identify some infrequent
candidate itemsets before scanning the
database to verify its support.
Uncertain Database
For
instance,
after
scanning
From
theupper
expected
supports
anddatabase,
This
is
an
bound
of the the
the
expected
support
of item
A,B,C
maximum
existential
probabilities
expected
support
of {A,B},
which
is are
obtained.
obtained
above,
we can obtain an
smaller
than
min_sup.
upper bound of the candidates
Thus
it must
be
infrequent
and
canthe
be
During
thescanning
database
scan,
keep
BEFORE
the
database.
pruned.
maximum existential probability of each
item.
min_sup = 2
A
B
C
T1
30%
50%
100%
T2
70%
80%
90%
T3
90%
40%
30%
T4
30%
40%
40%
2.2
2.1
2.7
90%
80%
100%
{A,B}
{A,C}
{B,C}
1.76
2.2
2.1
Expected Support
Maximum existential probability
For {A,B}, if ALL items A
Size-2-candidate itemsets are
matches with B with B’s
generated.
maximum existential
probability, {AB} will have
expected support value
2.2*80% = 1.76
Size-2-candidate itemsets
Maximum expected support of size-2-candidates
Method 3 – Candidates filtering
Uncertain Database
This is an upper bound of the
expected support of {A,B}, which is
smaller than min_sup.
Thus it must be infrequent and can be
pruned.
min_sup = 2
A
B
C
T1
30%
50%
100%
T2
70%
80%
90%
T3
90%
40%
30%
T4
30%
40%
40%
2.2
2.1
2.7
90%
80%
100%
{A,B}
{A,C}
{B,C}
1.76
2.2
2.1
Expected Support
Maximum existential probability
For {A,B}, if ALL items A
matches with B with B’s
maximum existential
probabilities, {AB} will have
expected support value
2.2*80% = 1.76
Size-2-candidate itemsets
Maximum expected support of size-2-candidates
Section 5
Experimental Results
and Discussions
Experiments
Synthetic datasets

Data associations

Generated by IBM Synthetic Generator.




T100R75%I6D100K
HB90HD5LB10LD5
Average length of each transaction (T)
Average length of hidden frequent patterns (I)
Number of transactions (D)
Data uncertainty


We would like to simulate the situation that there are some items
with high existential probabilities, while there are also some
items with low existential probabilities.
Bimodal distribution




Base of high existential probabilities (HB)
Base of low existential probabilities (LB)
Standard Deviations for high and low existential probabilities
(HD,LD)
Percentage of item with low existential probabilities (R)
Experiments
Implementation


T100R75%I6D100K
HB90HD5LB10LD5
C programming language
Machine
 CPU
: 2.6 GHz
 Memory : 1 Gb
 Fedora

Experimental settings :
 T100R75%I6D100K
 136
HB90HD5LB10LD5
Mb
 Support threshold 0.5%
Experimental Results
Trimming Method
Execution time of Trimming Methods VS Uncertain
Apriori in each iteration
300
Execution time
250
200
Uncertain Apriori
150
Trimming - Global Max
Count
Trimming - Local Max Count
100
50
T100R75%I6D100K
HB90HD5LB10LD5
Iteration 2 is
computationally expensive
because there are many
candidates, leading to
heavy computational effort
in the subset function.
Trimming methods can
successfully reduce the
number of subset increments
in ALL iterations.
0
1
2
3
4
5
6
7
8
Iterations
For Uncertain Apriori, each transaction
Since=we
usesize-2
the one-pass
has 100C2
4950
subsets.
patch up strategy, trimming
For Trimming,
methodseach
havetransaction
one more has at
least 25C2
300phase.
size-2 subsets only!
Patch= Up
Plus the time spent on Patch
Up phase, the trimming
methods still have a significant
performance gain.
T100R75%I6D100K
HB90HD5LB10LD5
Experimental Results
CPU Cost Saving by Trimming
CPU Cost of Trimming Methods VS Uncertain Apriori in each iteration
250
200
CPU Cost (s)
Negative CPU saving
in iteration 1 because
time is spent on
gathering the statistics
for the Pruning
Module.
150
Trimming methods
achieve high
computational saving
in iterations where
CPU cost is significant.
Uncertain Apriori
Trimming - Global Max Count
Trimming - Local Max Count
100
50
0
1
2
3
4
5
6
7
8
Iteration
CPU Cost Saving in each iteration
Percentage of CPU Cost Saving from iteration 2 to 6
250
100
90
80
150
Trimming - Global Max Count
100
Trimming - Local Max Count
50
CPU Cost Saving (%)
CPU Cost Saving (s)
200
70
60
Trimming - Global Max Count
50
Trimming - Local Max Count
40
30
20
10
0
1
2
3
4
5
-50
6
7
8
0
2
Iterations
3
4
Iteration
5
6
Experimental Results
I/O Cost Saving by Trimming
T100R75%I6D100K
HB90HD5LB10LD5
I/O Cost of Trimming Methods VS
Uncertain Apriori in each iteration
7
6
I/O Cost (s)
5
Uncertain Apriori
4
Trimming - Global Max Count
3
I/O Cost saving occurs from iteration 3 to
iteration 6.
That is, I/O cost saving will increase if there
are longer frequent itemsets.
Trimming - Local Max Count
2
I/O Cost Saving of Trimming
Methods in each iteration
1
0
1
2
3
4
5
6
7
8
5
Iterations
4
3
Trimming Methods have extra I/O effort in
iteration 2 because they have to scan the
original database PLUS create the trimmed
database.
I/O Cost Saving (s)
2
1
0
-1
Trimming - Global Max Count
1
2
3
4
5
-2
-3
-4
-5
-6
Iterations
6
7
8
Trimming - Local Max Count
Experimental Results
Varying Support Threshold
T100R75%I6D100K
HB90HD5LB10LD5
Execution time of Trimming Methods VS Uncertain Apriori for different support thresholds
3000
Total Execution Time (s)
2500
2000
Uncertain Apriori
1500
Trimming - Global Max
Count
Trimming - Local Max Count
1000
500
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Expected Support Threshold (%)
1
The rate of increase in execution
time of Trimming Method is
smaller than that of Uncertain
Apriori.
T100R ? %I6D100K
HB90HD5LB10LD5
Experimental Results
Varying percentage of items with low existential probability
Execution time of Trimming Methods VS Uncertain Apriori for different percentages of items
with low existential probability
ALL the itemsets are having the
same frequent itemsets.
1800
1600
90.9
Total Execution Time(s)
1400
1200
Uncertain Apriori
1000
Trimming (Global Max Count)
800
Trimming (Local Max Count)
600
83.33
400
77.7
200
20
0
0%
60
33
71
50%
Percentage of items with low existential probability (R)
100%
Trimming Methods achieve
almost linear execution time in
increasing percentage of items
with low existential probability.
T100R75%I6D100K
HB90HD5LB10LD5
Experimental Results
Decremental Pruning
Percentage of Candidates Pruned during
database scan for 2nd iteration
The “Integrate with Hash Tree” method
outperforms the “Aggregated by items” method.
100
Percentage of Candidates Pruned
90
80
70
Decremental (Aggregate by
items)
60
50
Execution time of Decremental Pruning VS
Uncertain Apriori for different percentages of
items with low existential probability
Decremental (Integrate with
Hash tree)
40
30
1800
20
1600
10
1400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Fraction of Database Scanned
Although “Integrate with Hash Tree” method
can prune twice number of candidates than
power
the Decremental
thePruning
“Aggregate
byofitems”
method, the time
Methods
in
2nd
iteration.
saving is not significant.
This is because the “Integrate with Hash
Tree” method has more overhead.
Execution time (s)
0
1200
Uncertain Apriori
1000
Decremental (Aggregate by
items)
Decremental (Integrate with
Hash tree)
800
600
400
200
0
0%
50%
Percentage of items with low existential
probability (R)
100%
T100R75%I6D100K
HB90HD5LB10LD5
Experimental Results
Varying percentage of items with low existential probability
Execution time of Decremental and Trimming Methods VS Uncertain Apriori
for different percentages of items with low existential probability
1800
Execution time (s)
1600
1400
Uncertain Apriori
1200
Trimming (Global Max
Count)
Trimming (Local Max Count)
1000
800
Decremental (Aggregate by
items)
Decremental (Integrate with
Hash tree)
600
400
200
0 0%
50%
Percentage of items with low existential
probability (R)
100%
The Trimming and
Decremental Methods can
combine together to form a
Hybrid Algorithm
T100R75%I6D100K
HB90HD5LB10LD5
Experimental Results
Hybrid Algorithms
Execution time of Different Combinations VS Uncertain Apriori
for different percentages of items with low existential probability
1800
Uncertain Apriori
1600
Decremental (Aggregate by items)
Execution time (s)
Combining the1400
3 proposed methods
achieves the smallest
execution
1200
time.
Decremental (Integrate with Hash
tree)
1000
Decremental (Aggregate by items) +
Trimming
800
600
Decremental (Integrate with Hash
tree) + Trimming
400
Decremental (Integrate with Hash
tree) + Trimming + Candidate
Pruning
200
0
0%
50%
Percentage of items with low existential probability (R)
100%
Experimental Results
T100R75%I6D100K
HB90HD5LB10LD5
Varying percentage of items with low existential probability
Overall CPU saving of the Hybrid Algorithm for different
percentages of items with low existential probability
100
Overall I/O saving of the Hybrid Algorithm for different
80% or more CPU
cost isofsaved
dataset
with
percentages
items with for
low existential
probability
40% or more
items with low existential probability.
50
40
80
60
Decremental (Integrate with
Hash tree) + Trimming +
Candidate Pruning
40
20
20
10
Decremental (Integrate with Hash
tree) + Trimming + Candidate
Pruning
0
-10
CPU cost saving occurs when there are 5% or
-20
more items with
low existential probability in the
-30
dataset.
0
-20
Total I/O Saving (%)
Total CPU cost saving (%)
30
0%
50%
100%
Percentage of items with low existential probability
(R)
-40
Actually the I/O saving should also depends on
the
of hidden
frequent
I/Olength
cost saving
occurs
when itemsets,
there are which
40% orcan
be
shown
bywith
varying
the (I) parameter
in the
more
items
low existential
probability
in the
dataset
generation
process.
dataset.
0%
50%
Percentage of items with low existential probability (R)
100%
In fact, this figure only shows that
I/O cost saving will increase if more
items are trimmed.
Conclusion





We have defined the problem of mining frequent
itemsets from uncertain database.
Possible World interpretation has be adopted to be the
theoretical background of the mining process.
Existing frequent itemsets mining algorithms are either
inapplicable or unacceptably inefficient to mine uncertain
data.
We have identified the computational bottleneck of
Uncertain Apriori, and
Proposed a number of efficient methods to reduce both
CPU and I/O cost significantly.
Future Works


Sensitivity and scalability test on each
parameters T,I,K,HB,LB…etc
Generate Association Rules from uncertain data
 What
is the meaning of the association rules mined
from uncertain data?


Real Case Study
Other types of association rules
 Quantitative association rules
 Multidimensional association rules

Papers…
Now I am interested in these kind of
association rules 
80% Eating Disorder => 90% Depression
End
Thank you 