Detecting Group Differences: Mining Contrast Sets
Download
Report
Transcript Detecting Group Differences: Mining Contrast Sets
Detecting Group Differences:
Mining Contrast Sets
Author: Stephen D. Bay
Advisor: Dr. Hsu
Graduate: Yan-Cheng Lin
Outline
Motivation
Objective
Research Review
Search for Contrast Sets
Filtering for Summarizing Contrast
Set
Evaluation
Conclusion
Motivation
Learning group differences a central
problem in many domains
Contrasting groups especially
important in social science research
Objective
Automatically detect differences
between contrasting groups from
observational multivariate data
Research Review
time series research
traditional statistical methods
rule learner and decision tree
multiple observations
miss group differences
association rule mining
multiple group and different search
criteria
Problem Definition
itemset concept extends to contrast
set
Definition 1:
Let A1,A2,...,Ak be a set of k variables called
attributes.
Each Ai can take on values from the set
{Vi1,Vi2,...Vim}.
Contrast set a conjunction of attribute –
value pairs defined on groups G1,G2,...,Gn
with no Ai occurring more than once.
Define support of contrast set
Definition 2:
The support of a contrast set with respect to a group
G is the percentage of examples in G where the
contrast set is true.
minimum support difference δ user defined threshold
Search for Contrast Sets
find contrast sets meet our criteria
though search
explore all possible contrast sets
return only sets meet our criteria
STUCCO (Search and Testing for
Understandable Consistent
Contrasts): breadth-first search
incorporates several efficiently
mining techniques
Framework
use set-enumeration trees
use breadth-first search
counting phase organize nodes into candidate
groups
Finding Significant Contrast Sets
testing the null hypothesis across all groups
support counts from contingency tables
Controlling Search Error
data mining test many hypotheses
family of tests control Type I error
Bonferroni inequality:given any set of
events e1,e2,...,en, the probability of their
union is less than or equal to the sum of
the individual probabilities
Pruning
prune when contrast sets fail to meet
effect size or statistical significance
criteria
prune when lead to uninteresting contrast
sets
Effect Size Pruning
prune nodes when bound maximum support
difference groups below δ
Statistical Significance Pruning
pruned when too few data or maximum value
X2 too small
Interest Based Pruning
contrast sets are not interesting when have
identical support or relation between groups is
fixed
Specializations with Identical Support
marital-status=husband
marital-status=husband ^ Sex = male
Fixed Relations
Fixed Relations
prune node as contrast set specializations do
not add new information
Relation to Itemset Mining
minimum support difference criterion implies
constraints support levels in individual groups
eliminate large portions of the search space based
on:
subset infrequency pruning
effect size pruning
superset frequency pruning
interest based pruning
ab
abc
Filtering for Summarizing Contrast Set
past approaches
limit the rules shown by constraint the
variables or items
compare discovered rules, show only
unexpected results
new methods
expectation based statistical approach
identify and select linear trend contrast
sets
Statistical Surprise
show most general contrast sets
first, more complicated conjunctions
if surprising based on previously
shown sets
IPF(Iterative Proportional Fitting)
find maximum likelihood estimates
Detecting Linear Trends
identical to finding change over time
detect significant contrast set by using the chisquare test
use regression techniques to find the portion of
the x2
Evaluation
three research points:
low support difference
few high support attribute-value pairs, lower bounds
can’t take advantage
pruning rules
δ -> 0 statistical significance pruning is more
important
filtering rules
Conclusion
STUCCO algorithm combined
statistical hypothesis testing with
search for mining contrast sets
STUCOO has
pruning rules efficient mining at low
support differences
guaranteed control over false positives
linear trend detection
compact summarization of result