Transcript ppt

An Efficient Rigorous Approach for
Identifying Statistically Significant Frequent Itemsets
Adam Kirsch, Michael Mitzenmacher, Havard University
Andrea Pietracaprina, Geppino Pucci, Fabio Vandin, University of Padova
Eli Upfal, Brown University
PODS 2009
Presented by Dongjoo Lee, IDS Lab., CSE, SNU
Frequent Pattern Mining
Given a dataset D of transactions over a set of items I, and a support threshold s, return all
itemsets X such that X ⊆ I with support at least s in D (i.e., contained in at least s transactions).
D
support
Mining Algorithm
Frequent Itemsets






association rules
correlations
sequences
episodes
classifiers
clusters
 …
Transaction Datasets
 Apriori
 FP-Growth
 …
Among all possible nCk itemsets of size k (k-itemsets), we are interested in statistically significant ones,
that is, itemsets whose supports are significantly higher, in a statistical sense, than their expected supports in
a dataset where individual items are placed independently in transactions.
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
2
Measuring the Statistical Significance of a Discovery:
D
^
D
Model
dataset of t transactions on a set I of n items, where each transaction di ⊆I
n(i): number of transactions that contain item I
fi = n(i)/t: frequency of item i
random data set where item i is included in any given transaction with
probability fi, independent of all other items and all other transactions
Null hypothesis H0
The support s(X, D) in D is drawn from the same distribution as its support
^
^
s(X, D) in D.
Alternative hypothesis H1
The support s(X, D) in D is not drawn from that distribution, and in
particular that there is a positive correlation between the occurrences of the
individual items in X.
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
3
Measuring the Statistical Significance of a Discovery:
D
^
D
IDS Lab. 2009 Spring Seminar
Example

t = 1,000,000

|I| = 1,000

fi = fj = 1/1,000

support({i, j}) = 7

| Q7 | = 300

Ti,j = {t | t ∈ D, i ∈t, j ∈t}

Qn = {{i,j} | |Ti,j| = n}

Pr(i,j) = 0.000001

E[|Ti,j|] = (0.000001)×(1,000,000) = 1

Pr(|Ti,j| = 7) = 0.0001

1000C7

E[| Q7 |] = 0.0001×499,500 = 50

Pr(| Q7 | = 300) ≤ 2-300
= 499,500
Copyright 2009 by CEBT
Binomial distribution
Pr(k) = nCkpk(1-p)n-k
4
Statistical Hypothesis Testing
What is the probability (Pr: p-value) of
observation if the null hypothesis is true
observation
D
| Q7 | = 300
Pr(| Q7 | = 300|H0) ≤ 2-300
^
D
If the observation is in critical region,
reject the null hypothesis.
C
p-value ≤ 0.05
2-300

Significance level of the test: α = Pr (Type I error)


probability of rejecting H0 when it is true (false positive)
Power of the test: β = 1- Pr (Type II error)

probability of correctly rejecting the null hypothesis
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
5
Multi-hypothesis Testing
Pr(| QX1,k | = s|H01) = pX1
Pr(| QX2,k | = s|H02) = pX2
…
Pr(| QXi,k | = s|H0i) = pXi
…
observation
D
nCk
| QX1, k | = s
| QX2, k | = s
…
| QXi, k | = s
…
?
^
D
C
≤ 0.05

The outcome of an experiment is used to test simultaneously a number of hypotheses.

Significance level of multi-hypothesis testing


Family Wise Error Rate (FWER)
–
probability of incurring at least one Type I error in any of the individual tests
–
conservative
–
for large numbers of hypotheses all of these techniques lead to test with low power.
False Discovery Rate (FDR)
–
less conservative
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
6
False Discovery Rate Control
 False Discovery Rate (FDR)

expected ratio of erroneous rejections among all rejections

FDR = E[V/R] ( V/R = 0 when R = 0)
–
V: number of Type I errors in the individual test
–
R: total number of null hypotheses rejected by the multiple test
 FDR Control

controls the expected proportion of incorrectly rejected null hypotheses.
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
7
Standard FDR Control
Consider all possible combination of k-itemsets
Get the p-value of itemset X with support s, following
binomial distribution
Find itemsets that keep FDR constraints
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
8
What Do the Authors Do?
1.
^
Approximate the distribution of Qk,s with minimum support smin.

2.
^
Find smin approximating distribution Qk,s with the error ϵ.

3.
Poisson Approximation by using Chen-Stein method.
A Monte Carlo method
Establish a support threshold s* with a controlled FDR.

Reduce the number of FDR compared to standard multi-comparison test
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
9
Poisson Distribution
If the expected number of occurrences in a certain interval is λ, then the
probability that there are exactly k occurrences (k being a non-negative
integer, k = 0, 1, 2,…) is equal to
f (k ;  ) 
k e  
k!
Probability mass function
IDS Lab. 2009 Spring Seminar
Cumulative distribution function
Copyright 2009 by CEBT
10
^
Poisson Approximation for Qk,s
 Let Qk,s be the number of itemsets of size k with support at least s with
^
^
respect to D, Qk,s is the corresponding random variable for D.
 Fix k and s,
 Define a collection of Bernoulli random variables

{ ZX | X ⊂ I, |X| = k }, such that ZX = 1 if the itemset X appears in at least s
^
transaction in the random dataset D, and ZX = 0 otherwise.

px = Pr(ZX = 1)
 Let I(X) = { X´ | X∩X´ ≠ ø, |X´| = |X|}
 If Y I(X) then ZY and ZX are independent.
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
11
^
Poisson Approximation for Qk,s

^
THEOREM 1. Let U be a Poisson random variable such that E[U] = E[ Q
k,s] = λ
^
^
< ∞. The variation distance between the distributions L(Q
k,s) of Qk,s and L(U) of
U is such that
b1 and b2 are both decreasing in s.
Therefore, if b1 + b2 < ϵ for a given s,
then the same upper bound will hold
for every s´ > s.
…
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
12
A Monte Carlo Method for Determining smin
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
13
A Novel Multi-Hypothesis Testing
Set initial support value as smin
Maximum number of calculation to obtain s*
Found the s*
Set next support value
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
14
Novel Testing vs. Standard Testing
Standard FDR Testing
Novel Testing
Constrain itemsets
Constrain support s*
Control more hypotheses
Control less hypotheses
Evaluate the significance of
individual itemset
Evaluate the significance of
entire itemsets
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
15
Experimental Results
IDS Lab. 2009 Spring Seminar
– Experiments on Benchmark Datasets
Copyright 2009 by CEBT
16
Experimental Results
IDS Lab. 2009 Spring Seminar
– Experiments on Random Datasets
Copyright 2009 by CEBT
17
Experimental Results
IDS Lab. 2009 Spring Seminar
– Comparison with Standard FDR Test
Copyright 2009 by CEBT
18
Conclusion
 In a random dataset where items are placed independently in
transactions, there is a minimum support smin such that the
number of k-itemsets with support at least smin is well
approximated by a Poisson distribution.
 Novel multi-hypothesis testing incur a small FDR tests.
 First attempt at establishing a support threshold for the
classical frequent itemset mining problem with a quantitative
guarantee on the significance of the output.
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
19
Discussion
 Hard to understand, because it needs so many related
knowledge or notions to understand the content
 Pros

Good Approximation

Less FDR

Find appropriate support through exploring structure of whole
dataset
 Cons

Fail to find significant itemsets with small support
IDS Lab. 2009 Spring Seminar
Copyright 2009 by CEBT
20