PPT Version of Presentation Slides

Download Report

Transcript PPT Version of Presentation Slides

Investigation of sub-patterns discovery
and its applications
Presenter: Xun Lu
Supervisor: Jiuyong Li
Content
•
•
•
•
•
Overview
Brief Introduction
Basic Definitions
STUCCO Algorithm
MORE Algorithm
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
Overview of this research study
•
•
•
To examine and differentiate various kinds of
contrast patterns
The scope of this research is in an attempt to
understand the principles and algorithms involved
in sub-patterns, i.e. contrast sets discovery
This thesis, ultimately, is trying to adopt the
techniques applied in STUCCO to improve the
efficiency of MORE algorithm.
3
Content
•
•
•
•
•
Overview
Brief Introduction
Basic Definitions
STUCCO Algorithm
MORE Algorithm
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
What is Contrast data mining?
•
Contrast – “To compare or appraise in respect to
differences” (Merriam Webster Dictionary)
•
Contrast data mining – The mining of patterns
and models contrasting two or more
classes/conditions.
5
Why Contrast data mining?
“Sometimes
it's good to contrast what
you like with something else. It
makes you appreciate it even
more”
Darby Conley, Get Fuzzy, 2001
6
Content
•
•
•
•
•
Overview
Brief Introduction
Basic Definitions
STUCCO Algorithm
MORE Algorithm
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
Some definitions for STUCCO
Contrast set: a conjunction of attribute-value pairs
defined on groups with no attribute occurring more
than once.
Support of a cset: the ratio of the record number
containing cset to the number of all records in the data
set. supp(cset) ≈ prob(cset)
Group: cset with the same prefix are placed in one
group
Upper bound: the support of an itemset consisting of
the head of the group and one item
Lower bound: the support of an itemset consisting all
the items the group
Some definitions for MORE
Contingency table Relative Risk
Risk
Disease Status
Present
Absent
Smoking
a
b
Non-Smoking
c
d
Present and Absent can be treated as class labels
(head/prefix) whereas Smoking and Non-Smoking
can be seen as the rest of elements of a contrastset. (here we only have two attributes)
RR 
pexp osed
pnonexp osed
a (a  b) ad


c (c  d ) bc
Content
•
•
•
•
•
Overview
Brief Introduction
Basic Definitions
STUCCO Algorithm
MORE Algorithm
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
STUCCO
•
•
•
Search Testing for Understandable Consistent
Contrast
Developed by Bay and Pazzani. It aims to
efficiently mine all the contrast sets which are
significant and large, without predefined support
thresholds
It defines the support by finding the maximum
difference between upper bound and lower bound
within a group.
 max  max U [i]  L[ j ]
ij ,i  j
11
STUCCO pruning strategies
• Effective size pruning
U [i]  L[ j ] this equation ensures effect size pruning by
•  max  max
ij ,i  j
pruning the cset with the upper bound below 
• Statistical significance pruning
• Chi-square  2
• Alternative techniques: leverage/lift/relative risk/odds ratio
• Interest based pruning
• Contrast sets are not interesting when they represent no
new information
• E.g. marital_status=husband Λ sex=male
Interest based pruning cont’
STUCCO prunes the cset that do not satisfy either one
of following conditions
•
i P(cset  True | Gi )  P(cset '  True | Gi ) (1)
•
max | s (cset , Gi )  s (cset ' , Gi ) |  s
i
(2)
•  s is normally set to a very small number,say δ/2
• If A and B are itemsets where A⊂B, we also prune
the following:
• If A is infrequent, prune A and B
• A={1,4,6} B={1,3,4,6}, supp(B)must be less than supp(A).
Filtering Algorithm
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
Content
•
•
•
•
•
Overview
Brief Introduction
Basic Definitions
STUCCO Algorithm
MORE Algorithm
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
MORE Algorithm
• Mining Optimal Risk pattErn sets
• Input: data set, minimum support and the minimum relative
risk threshold. Output: optimal risk pattern set
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
MORE cont’
• Advantage: it makes use of the anti-monotone
property to efficiently prune the search space
• anti-monotone: if (supp(Px|¬a))=supp(P|¬a)), then pattern
PX and all its super patterns do not occur in the optimal
risk pattern set
• Deficiencies:
• MORE requires a predefined minimum support;
• The Relative Ratio results fail to show statistical error and
residuals (details next slide);
• Needs to apply more techniques from STUCCO to
determine superfluous patterns.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
Statistical error and residuals
• Given this Risk pattern result generated by MORE:
RR=2.00. But this value is calculated from sample
mean, which may not represent the truth of
unobservable population mean.
• Hence, we need an acceptable range value, say
[1.84, 2.47], instead of a singe value for RR.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.
Questions?
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or
with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the
Act. DO NOT REMOVE THIS NOTICE.