Transcript ppt

Mining Frequent Itemsets from Uncertain Data
Presented by Chun-Kit Chui, Ben Kao, Edward Hung
Department of Computer Science, The University of Hong Kong
PAKDD 2007
2009-04-10
Summarized by Jaeseok Myung
Reference Slides : i.cs.hku.hk/~ckchui/kit/modules/getfiles.php?file=2007-5-23%20PAKDD2007.ppt
Contents
 Introduction

Existential Uncertain Dataset

Calculating the Expected Support
–
From uncertain dataset
 Contribution

The U-Apriori Algorithm

Data Trimming Framework
–
Dealing with computational issues of the U-Apriori algorightm
 Experiments
 Conclusion
Center for E-Business Technology
Copyright  2009 by CEBT
Existential Uncertain Dataset
 An existential uncertain dataset is a transaction dataset in which
each item is associated with an existential probability indicating
the probability that the item “exists” in the transaction
Traditional Transaction Dataset
Item 1
Item 2
…
Transaction 1
O
O
…
Transaction 2
O
X
…
…
Existential Uncertain Dataset
Item 1
Item 2
…
Transaction 1
O (90%)
O (85%)
…
Transaction 2
O (60%)
O (5%)
…
…
Center for E-Business Technology
Copyright  2009 by CEBT
Existential Uncertain Dataset
Psychological Symptoms Dataset
Mood
Disorder
Anxiety
Disorder
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Patient 1
97%
5%
84%
14%
76%
…
9%
Patient 2
90%
85%
100%
86%
65%
…
48%
Self Destructive
Disorder
…
 In many applications, the existence of an item in a transaction is
best captured by a likelihood measure or a probability

Symptoms, being subjective observations, would best be
represented by probabilities that indicate their presence

The likelihood of presence of each symptom is represented in terms
of existential probabilities
Center for E-Business Technology
Copyright  2009 by CEBT
Association Analysis
Psychological Symptoms Dataset
Mood
Disorder
Anxiety
Disorder
Eating
Disorder
ObsessiveCompulsive Disorder
Depression
…
Patient 1
97%
5%
84%
14%
76%
…
9%
Patient 2
90%
85%
100%
86%
65%
…
48%
Self Destructive
Disorder
…
 The psychologists maybe interested in the associations between
different symptoms

Mood Disorder => Eating Disorder + Depression
 Association Analysis from Uncertain Dataset

A core step is the extraction of frequent itemsets

The occurrence frequency is often expressed in terms of a support

However, the definition of support needs to be redefined
Center for E-Business Technology
Copyright  2009 by CEBT
Possible World Interpretation
Psychological symptoms dataset
 A dataset with two
psychological symptoms and
two patients
 16 possible world in total
 The support counts of
itemsets are well defined in
each individual world
From the dataset,
one
possibility
is
7
S1
S2
8
S1
that both patients are actually having
P1
P1
√
×
×
both psychological illnesses
S2
×
Depression
Eating Disorder
Patient 1
90%
80%
Patient 2
40%
70%
1
S1
S2
2
S1
S2
3
S1
S2
P1
√
√
P1
×
√
P1
√
×
P2
√
√
P2
√
√
P2
√
√
4
S1
S2
5
S1
S2
6
S1
S2
P1
√
√
P1
√
√
P1
√
√
P2
×
√
P2
√
×
P2
×
×
On 9the other
hand, 10
the uncertain
dataset
alsoS2
S1
S2
S1
S2
11
S1
captures the possibility that patient 1 only has
P1
√
P1
√
P1
√
×
×
×
eating disorder while patient 2 has both of the
P2
√
P2
√
P2
√
×
×
×
illnesses
P2
√
√
P2
√
×
12
S1
S2
13
S1
S2
14
S1
S2
15
S1
S2
16
S1
S2
P1
√
×
P1
×
√
P1
×
×
P1
×
×
P1
×
×
P2
×
×
P2
×
×
P2
√
×
P2
×
√
P2
×
×
Center for E-Business Technology
Copyright  2009 by CEBT
Possible World Interpretation
Psychological symptoms dataset
 Support of Itemset
{Depression, Eating Disorder}

World Di
Support {S1,S2}
World Likelihood
1
2
0.9 × 0.8
× 0.4 × 0.7
0.2016
2
1
0.0224
3
1
0.0504
4
1
0.3024
5
1
0.0864
6
0.1296
1 the likelihood
We can
also discuss
of
7
1
possible
world 1 being
the true0.0056
world
8
…
0
0.0336
0
…
Depression
Eating Disorder
Patient 1
90%
80%
Patient 2
40%
70%
1
S1
S2
2
S1
S2
3
S1
S2
P1
√
√
P1
×
√
P1
√
×
P2
√
√
P2
√
√
P2
√
√
4
S1
S2
5
S1
S2
6
S1
S2
P1
√
√
P1
√
√
P1
√
√
WeP2can×discuss
theP2support
count of
√
√
P2 the×
×
itemset {S1, S2} in possible world 1
×
7
S1
S2
8
S1
S2
9
S1
S2
10
S1
S2
11
S1
S2
P1
×
×
P1
√
×
P1
×
√
P1
×
√
P1
√
×
P2
√
√
P2
√
×
P2
×
√
P2
√
×
P2
×
√
The same process can be applied for all
possible world
12
S1
S2
13
S1
S2
14
S1
S2
15
S1
S2
16
S1
S2
P1
√
×
P1
×
√
P1
×
×
P1
×
×
P1
×
×
P2
×
×
P2
×
×
P2
√
×
P2
×
√
P2
×
×
Center for E-Business Technology
Copyright  2009 by CEBT
Expected Support
World Di
Support {S1,S2}
World Likelihood
Weighted Support
1
2
0.2016
0.4032
2
1
0.0224
0.0224
3
1
0.0504
0.0504
4
1
0.3024
0.3024
5
1
0.0864
0.0864
We expect
there will be
1 patient has
6
0.1296
1
both illnesses
7
0.0056
1
0.1296
0.0056
8
0
0.0336
0
…
0
…
0
Expected Support
Weighted support can be calculated for
each possible world
Expected support can be calculated by
summing up the weighted support of all
the possible worlds
1
 To calculate expected support, we need to consider all possible
worlds and obtain the weighted support in each of the
enumerated possible world
Center for E-Business Technology
Copyright  2009 by CEBT
Simplified Calculation of Expected Support
 Instead of enumerating all “Possible Worlds” to calculate the exp
ected support, it can be calculated by scanning the uncertain dat
aset once only
Where Pti(xj) is
the existential probability of item xj
in transaction tj
Psychological symptoms database
S1
S2
Weighted Support of {S1,S2}
Patient 1
90%
80%
0.72
Patient 2
40%
70%
0.28
Expected Support of
{S1,S2}
Center for E-Business Technology
1
Copyright  2009 by CEBT
The expected support of {S1, S2} can be
calculated by multiplying the existential
probabilities within the transaction and
obtain the total sum of all transactions
Problem Definition
 Given an existential uncertain dataset D with each item of a tra
nsaction associated with an existential probability, and a user
-specified support threshold s, return ALL the itemsets having ex
pected support greater than or equal to |D|× s.


Introduction

Existential Uncertain Dataset

Calculating the Expected Support Value
Contribution

The U-Apriori Algorithm

Data Trimming Framework

Experiments

Conclusion
Center for E-Business Technology
Copyright  2009 by CEBT
The Apriori Algorithm
Subset
The Subset Function scans the dataset once and obtain
the support count of All size-1 candidates
Function
Large
Itemsets
Candidates
{BC}
{A}
{BD}
{B}
{BE}
{C}
{CD}
{D}
{CE}
{E}
{DE}
{B}
{C}
{D}
If item {A} is infrequent, all itemsets including {A}
cannot be a candidate
X
X X X X X
{E}
The Apriori algorithm starts from size-1
candidate items
X X X X X X X X X X
Apriori-Gen
Center for E-Business Technology
X
X X X
The Apriori-Gen procedure generates only those
The algorithm iteratively prunes and verifies the
size-(k+1) candidates which are potentially
candidates, until no X
candidates are generated
frequent
Copyright  2009 by CEBT
X
The U-Apriori Algorithm
Subset

In uncertain dataset, each item is
associated with an existential probability

The Subset-Function reads the dataset
transaction by transaction to update the
support counts of the candidates
Function
Large
itemsets
Candidates
Apriori-Gen

The expected support of {1, 2} contributed
by transaction 1 is 0.9 * 0.8 = 0.72
Transaction 1
1 (90%)

2 (80%)
4 (5%)
5 (60%) 8 (0.2%)
Other processes are same to the original
Inherited from the Apriori algorithm, U-Apriori does
apriori
algorithm
not scale
well on
large datasets

The authors call this minor modified
If the expected support is too small, all resources
algorithm the U-Apriori algorithm
are wasted
Center for E-Business Technology
Copyright  2009 by CEBT
Candidate
Itemset
Expected
Support
{1,2}
0
0.72
{1,5}
{1,8}
{4,5}
{4,8}
0.54
0.0018
0.03
0.0001
Computational Issue
CPU cost in each iteration of different dataset
s
Fraction of items with low existential probabili
ty : 75%
We can potentially reduce
insignificant support calculation
7 synthetic datasets with same
frequent itemsets

Vary the percentages of items with
low existential probability (R) in the
datasets
1
2
0%
33.33%
Although all datasets contain
the same frequent itemsets, UApriori algorithm requires
different amount of time to
execute
Center for E-Business Technology

Copyright  2009 by CEBT
3
50%
4
60%
5
6
Iterations
66.67%
71.4%
7
75%
Data Trimming Framework
Uncertain dataset
I1
t1
90%
Trimmed dataset
Statistics
I2
I1
I2
t1
90%
80%
t2
80%
80%
t2
80%
4%
t3
2%
5%
t4
5%
95%
t5
94%
95%
t4
t5
94%
+
Total expected
support count being
trimmed
Maximum existential
probability being
trimmed
95%
I1
1.1
5%
95%
I2
1.2
3%

In order to deal with the computational issue, we can create a trimmed
dataset by trimming out all items with low existential probabilities

During the trimming process, some statistics are kept for error estimation
when mining the trimmed dataset



Total expected support count being trimmed of each item.
Maximum existential probability being trimmed of each item.
Other information : e.g. inverted lists, signature files …etc
Center for E-Business Technology
Copyright  2009 by CEBT
Data Trimming Framework
The uncertain database is first
passed into the trimming module
to remove the items with low
existential probability and gather
statistics during the trimming
process
Kth - iteration
Pruning
Module
Statistics
Original
Dataset
Infrequent
k-itemsets
Trimming
Module
Trimmed
Dataset
The trimmed dataset is then mined
by the U-Apriori algorithm
Center for E-Business Technology
The pruning module uses the
statistics gathered from the trimming
module in order to find out whether
the itemsets are infrequent in the
original dataset
Potentially frequent
itemsets
Frequent
The potentially frequent itemsets
are
Potentially
Itemsets in
the
Frequent
passed Patch
back toUp
the U-Apriori
original dataset
k-itemsets
algorithm to generate candidates for
Module
the next interation
Frequent
itemsets
in the itemsets pruned by the
The infrequent
trimmed
dataset
U-Apriori
algorithm can be a mistake
Uncertain
Apriori
The potentially frequent itemsets are verified by
the patch up module against the original dataset
Copyright  2009 by CEBT
Data Trimming Framework
What statistics are used in the pruning
strategy?
-Total expected support count being
trimmed of each item
- Maximum existential probability being
trimmed of each item
There are three modules under the
data trimming framework, each
module can have different strategies
Pruning
Module
Statistics
Original
Dataset
Trimming
Module
Infrequent
k-itemsets
Trimmed
Dataset
Potentially frequent
itemsets
Potentially
Frequent
Patch Up
k-itemsets
Uncertain
Apriori
The trimming threshold is global to all
items or local to each item?
- Local threshold
Center for E-Business Technology
Frequent
Itemsets in the
original dataset
Module
Frequent
itemsets in the
trimmed dataset
Can we use a single scan to verify all the
potentially frequent itemsets or multiple scans
over the original dataset?
- Single scan
Copyright  2009 by CEBT
Experimental Setup
Step 1: Generate data without uncertainty.
IBM Synthetic Datasets Generator
IBM Synthetic
Datasets Generator
Average length of each transaction (T = 20)
Average length of frequent patterns (I = 6)
Number of transactions (D = 100K)
Data Uncertainty Simulator
The proportion of
items with low pro
babilities is control
led by the parame
ter R (R=75%).
TID
Low probability
items generator
High probability
items generator
TID
Items
1
2,4,9
2
5,4,10
3
1,6,7
…
…
Step 2 : Introduce ex
istential uncertainty to
each item in the gene
rated dataset.
Items
1
2(90%), 4(80%), 9(30%), 10(4%), 19(25%)
2
5(75%), 4(68%), 10(100%), 14(15%), 19(23%)
3
1(88%), 6(95%), 7(98%), 13(2%), 18(7%),
22(10%), 25(6%)
…
…
Center for E-Business Technology
Assign more items with relatively low p
robabilities to each transaction.
Normal Distribution (mean = 10%,
standard deviation = 5%)
Copyright  2009 by CEBT
Assign relatively high probabilities to
the items in the generated dataset.
Normal Distribution (mean = 95
%, standard deviation = 5%)
CPU Cost with different R
When R increases, more items with low
existential probabilities are contained in
the dataset, therefore there will be
insignificant support increments
Since the trimming method has avoided
those insignificant support increments,
the CPU cost is much smaller that the
U-Apriori algorithm
The trimming approach achieves positive
CPU cost saving when R is over 3%.
When R is too low, fewer low probability
items can be trimmed and the saving cannot
compensate for the extra computational cost
in the patch up module
Center for E-Business Technology
Copyright  2009 by CEBT
Conclusion
 This paper discussed the problem of mining frequent itemsets
from existential uncertain data
 Introduce the U-Apriori algorithm, which is a modified version of
the Apriori algorithm, to work on such datasets
 Identified the computational problem of U-Apriori and proposed
a data trimming framework to address this issue

The Data Trimming method works well on datasets with high
percentage of low probability items
Center for E-Business Technology
Copyright  2009 by CEBT
Paper Evaluation
 Pros

Well-defined Problem

Good Presentation (well-organized paper)

Flexible Trimming Framework
 My Comments

Good research field & many opportunities
–
U-Apriori algorithm in 2007
Center for E-Business Technology
Copyright  2009 by CEBT