Transcript Slides

EXPLORE-BY-EXAMPLE: AN
AUTOMATIC QUERY STEERING
FRAMEWORK FOR INTERACTIVE
DATA EXPLORATION
By Kyriaki Dimitriadou, Olga Papaemmanouil and
Yanlei Diao
AGENDA

Introduction to AIDE
Data exploration
IDE: interactive data exploration
 AIDE: automated interactive data exploration
 Machine learning






AIDE framework

AIDE model:





Supervised learning: Decision tree
Unsupervised learning: K-means
Measure accuracy
Data classification
Query formulation
Space exploration
1.
Relevant object discovery
2.
Misclassifies exploitation
3.
Boundary exploitation
AIDE model summary
Conclusions
WHAT IS AIDE?
AUTOMATED INTERACTIVE
DATA EXPLORATION
EXPLORE DATA TO FIND AN APARTMENT
BUT MOM
I DON’T
WANT TO
MOVE!
EXPLORE DATA TO FIND AN APARTMENT
EXPLORE DATA TO FIND AN APARTMENT
DATA EXPLORATION
Data exploration is the first step
in data analysis and typically involves
summarizing the main characteristics of a dataset.
It is commonly conducted using visual analytics
tools, but can also be done in more advanced
statistical software, such as R.
IDE: INTERACTIVE DATA EXPLORATION
AIDE: AUTOMATED INTERACTIVE DATA
EXPLORATION
An Automatic Interactive Data Exploration
framework, that iteratively steers the user
towards interesting areas and “predicts” a query
that retrieves his objects of interest.
 AIDE integrates machine learning and data
management techniques to provide effective data
exploration results (matching user’s interest with
high accuracy) as well as high interactive
performance.

WHAT IS MACHINE LEARNING?
WHAT IS MACHINE LEARNING?
One definition: “Machine learning is the semiautomated extraction of knowledge from the data”
 Knowledge from data: Starts with a question
that might be answerable using data
 Automated extraction: A computer provide the
insights
 Semi-Automated: Requires many smart
decisions by a human
TWO MAIN CATEGORIES OF MACHINE LEARNING
Supervised learning: Making predictions using
data
 Example: is a given email “spam” or “ham”?
 There is an outcome we are trying to predict
Unsupervised learning: Extracting structure
from data
 Example: Segment grocery store shoppers into
clusters that exhibits similar behavior
 There is no “right answer”
SUPERVISED LEARNING
High level steps of supervised learning:
1. First, train a machine learning model using
labeled data
“Labeled data” has been labeled with the outcome
 “Machine learning model” learns the relationship
between the attributes of the data and its outcome

2.
Then, make prediction on new data for which
the label is unknown
SUPERVISED LEARNING
The primary goal of supervised learning is to build
a model that “generalizes”: It accurately predicts
the future rather then the past!
SUPERVISED LEARNING
X1
X2
X3
Mail “Hello..”
1
29
1
Mail “Dear…”
2
17
3
Mail “Check
3
out..”
58
1
The primary goal of supervised learning is to build
a model that “generalizes”: It accurately predicts
the future rather then the past!
SUPERVISED LEARNING
Y
Mail1
Ham
Mail2
Spam
Mail3
Ham
The primary goal of supervised learning is to build
a model that “generalizes”: It accurately predicts
the future rather then the past!
SUPERVISED LEARNING
The primary goal of supervised learning is to build
a model that “generalizes”: It accurately predicts
the future rather then the past!
DECISION TREE CLASSIFIER
Labeled data: 𝑥1 , 𝑦 , 𝑥2 , 𝑦 , … , (𝑥𝑛 , 𝑦)
 Main idea:

Form binary tree
 Minimize the error in each leaf

𝑥2
𝑥1
DECISION TREE CLASSIFIER
𝑥2
𝒙𝟏 > 𝟏
N
Y
N
Y
𝒙𝟏 > 𝟐
N
(𝟐, 𝟑)
1
𝒙𝟐 > 𝟏
(𝟎, 𝟒)
𝑥1
𝒙𝟏 > 𝟏. 𝟖
Y
(𝟑, 𝟎)
N
(𝟒, 𝟎)
Y
(𝟏, 𝟒)
1
1.8 2
DECISION TREE CLASSIFIER
𝑥2
𝒙𝟏 > 𝟏
N
Y
N
Y
𝒙𝟏 > 𝟐
N
(𝟐, 𝟑)
1
𝒙𝟐 > 𝟏
(𝟎, 𝟒)
𝑥1
𝒙𝟏 > 𝟏. 𝟖
Y
(𝟑, 𝟎)
N
(𝟒, 𝟎)
Y
(𝟏, 𝟒)
1
1.8 2
DECISION TREE CLASSIFIER
𝑥2
𝒙𝟏 > 𝟏
N
Y
N
Y
𝒙𝟏 > 𝟐
N
(𝟐, 𝟑)
1
𝒙𝟐 > 𝟏
(𝟎, 𝟒)
𝑥1
𝒙𝟏 > 𝟏. 𝟖
Y
(𝟑, 𝟎)
N
(𝟒, 𝟎)
Y
(𝟏, 𝟒)
1
1.8 2
HOW DECISION TREE REALLY WORKS?
….
𝑥1
1
1
1
1
1
0
0
0
0
0
Initial error: 0.2
After split: 0.5*0.4 + 0.5*0 = 0.2
Is this a good split?
label
HOW DECISION TREE REALLY WORKS?
Selecting predicates - splitting criteria
 potential function val(.) to guide our selection:
Every change is an improvement. We will be able to
achieve this by using a strictly concave function.
 The potential is symmetric around 0.5, namely,
val(q)= val(1 − q).
 When zero perfect classification. This implies that
val(0) = val(1) = 0
 We have val(0.5) = 0.5

val(T) ≥ error(T)
 Minimizing val(T) upper bounds the error!

HOW DECISION TREE REALLY WORKS?
Splitting criteria:
 Gini Index: G(q) = 2q(1 − q)
Before the split we have G(0.8)=2· 0.8· 0.2 = 0.32
 After the split we have
0.5G(0.6) + 0.5G(1) = 0.5 · 2 · 0.4 · 0.6 = 0.24.

COMMENTS ON DECISION TREE METHOD
Strength:





Easy to use, understand
Produce rules that are easy to interpret & implement
Variable selection & reduction is automatic
Do not require the assumptions of statistical models
Can work without extensive handling of missing data
Weakness:
May not perform well where there is structure in the
data that is not well captured by horizontal or
vertical splits
 Since the process deals with one variable at a time,
no way to capture interactions between variables
 Trees must be pruned to avoid over-fitting of the
training data

UNSUPERVISED LEARNING
High level steps of unsupervised learning:
Also called clustering, sometimes called classification
by statisticians and sorting by psychologists and
segmentation by people in marketing
1.
Organizing data into classes such that there is
High intra-class similarity
 Low inter-class similarity between the attributes of the
data and its outcome

2.
3.
Finding the class labels and the number of classes
directly from the data (in contrast to classification).
More informally, finding natural groupings among
objects.
WHAT IS A NATURAL GROUPING AMONG THESE
OBJECTS?
WHAT IS A NATURAL GROUPING AMONG THESE
OBJECTS?
CLUSTERING IS SUBJECTIVE
Simpson's Family School Employees
Females
Males
WHAT IS SIMILARITY?
The quality or state of being similar; likeness; resemblance; as, a
similarity of features.
Webster's Dictionary
Similarity is
hard to define,
but…
“We know it
when we see it”
The real
meaning of
similarity is a
philosophical
question. We
will take a
more
pragmatic
approach.
DEFINING DISTANCE MEASURES
Definition: Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real number
denoted by D(O1,O2)
Peter Piotr
0.23
3
342.7
INTUITION BEHIND DESIRABLE DISTANCE
PROPERTIES
1.
D(A,B) = D(B,A)

2.
3.
Otherwise you could claim “Alex looks more like Bob, than
Bob does.”
D(A,B) = 0 IIf A=B

Positivity (Separation)
Otherwise there are objects in your world that are
different, but you cannot tell apart.
D(A,B)  D(A,C) + D(B,C)

Symmetry
Triangular Inequality
Otherwise you could claim “Alex is very like Bob, and Alex
is very like Carl, but Bob is very unlike Carl.”
ALGORITHM K-MEANS
Goal:
1.
2.
3.
4.
5.
Decide on a value for k
Initialize the k cluster centers (randomly, if
necessary).
Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
If none of the N objects changed membership in the
last iteration, exit. Otherwise goto 3.
K-MEANS CLUSTERING: STEP 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
k2
2
1
k3
0
0
1
2
3
4
5
K-MEANS CLUSTERING: STEP 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
k2
2
1
k3
0
0
1
2
3
4
5
K-MEANS CLUSTERING: STEP 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k3
k2
1
0
0
1
2
3
4
5
K-MEANS CLUSTERING: STEP 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k3
k2
1
0
0
1
2
3
4
5
K-MEANS CLUSTERING: STEP 5
Algorithm: k-means, Distance Metric: Euclidean Distance
expression in condition 2
5
k1
4
3
2
k2
k3
1
0
0
1
2
3
4
expression in condition 1
5
COMMENTS ON THE K-MEANS METHOD
Strength:


Relatively efficient: O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global
optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
Weakness:
Applicable only when mean is defined, then what
about categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex
shapes

MEASURE ACCURACY
Precision is the
fraction of retrieved
instances that are
relevant
 Recall is the fraction
of relevant instances
that are retrieved

MEASURE ACCURACY: F-SCORE
The F score can be interpreted as a weighted
average of the precision and recall
 F score reaches its best value at 1 and worst at 0.

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝑠𝑐𝑜𝑟𝑒 = 2 ∗
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
QUESTION ABOUT MACHINE LEARNING
How do I choose which attributes of my data to
include in the model?
 How do I choose which model to use?
 How do I optimize this model for best
performance?
 How do I ensure that I’m building a model that
will generalize to unseen data?
 Can I estimate how well my model is likely to
perform on unseen data?

BACK TO AIDE…
HOW DOES AIDE WORKS?
Framework that automatically “steers” the user
towards data areas relevant to his interest
 In AIDE, the user engages in a “conversation”
with the system indicating his interests, while in
the background the system automatically
formulates and processes queries that collect
data matching the user interest

AIDE FRAMEWORK
Decision
Tree
classifier
Label
data
samples
Retrieve
next
sample set
from DB
Identify
promising
sampling
areas
AIDE CHALLENGES
AIDE operates on the unlabeled data space that
the user aims to explore
 To achieve desirable interactive experience for
the user, AIDE needs not only to provide
accurate results, but also to minimize the
number of samples presented to the user (which
determines the amount of user effort).
 Trade-off between quality of results :accuracy
and efficiency: the total exploration time which
includes the total sample reviewing time and
wait time by the user.

ASSUMPTIONS
Predictions of linear patterns: user interest are
captured by range queries
 Binary, non noisy, relevance system where the
user indicates whether a data object is relevant
or not to him and this categorization cannot be
modified in the following iterations.
 Categorical, numerical features

DATA CLASSIFICATION
Decision tree classifier to identify linear patterns
of user interest
 Decision tree advantages:

Easy to interpret
 Perform well with large data
 Easy mapping to queries that retrieve the relevant
data objects
 Can handle both numerical and categorical data

QUERY FORMULATION
Let us assume a decision tree classifier that
predicts relevant and irrelevant clinical trials
objects based on the attributes age and dosage
QUERY FORMULATION
SELECT *
FROM table
WHERE (age ≤ 20 and dosage >10 and dosage ≤ 15)
or (age > 20 and age ≤ 40 and dosage ≥ 0 and
dosage ≥ 10)).
SPACE EXPLORATION OVERVIEW
focus is on optimizing the effectiveness of the
exploration while minimizing the number of
samples presented to the user
 goal is to discover relevant areas and formulate
user queries that select either a single relevant
area (conjunctive queries) or multiple ones
(disjunctive queries).
 three exploration phases:

Relevant Object Discovery
 Misclassified Exploitation
 Boundary Exploitation

PHASE ONE: RELEVANT OBJECT DISCOVERY
Focus on collecting samples from yet unexplored
areas and identifying single relevant object.
 This phase aims to discover relevant objects by
showing to the user samples from diverse data
areas
 To maximize the coverage of the exploration
space it follows a well-structured approach that
allows AIDE to:

1.
2.
3.
ensure that the exploration space is explored widely
keep track of the already explored sub-areas
explore different data areas in different granularity
PHASE ONE: RELEVANT OBJECT DISCOVERY
LEVEL 1
LEVEL 2
120
120
LEVEL 3
120
Attribute A
9313 120
80
0
Attribute B
120
120
d – number of features
𝛽 - granularity
𝛽 𝑑 - grid cells
PHASE ONE: RELEVANT OBJECT DISCOVERY
Algorithm:
 Retrieve a single
random object within
distance 𝛾 < 𝛿 2 along
each dimension from
this center
PHASE ONE: RELEVANT OBJECT DISCOVERY
Optimizations:
 Hint-based object discovery: specific
attributes ranges on which the user desires to
focus
 Skewed attributes: use K-means algorithm to
partition the data space into k clusters. Data
base objects are assigned to the cluster with the
closest centroid
PHASE TWO: MISCLASSIFIED EXPLOITATION
Goal is to discover relevant areas as opposed to
single object.
 This phase strategically increase the relevant
object in the training set such that the predicted
queries will select relevant areas
 Designed to increase both the precision and
recall of the final query.
 Strive to limit the number of extraction queries
and hence the time overhead of this phase

PHASE TWO: MISCLASSIFIED EXPLOITATION
Generation of
Misclassified Samples:
 Assuming decision tree
classifier Ci is generated
on i-th iteration.
 This phase leverages the
misclassified samples to
identify the next set of
sampling areas in order to
discover more relevant
areas.
 addresses the lack of
relevant samples by
collecting more objects
around false negatives.
PHASE TWO: MISCLASSIFIED EXPLOITATION
“Naïve” Algorithm:



collect samples around each false
negative to obtain more relevant
samples.
very successful in identifying
relevant areas
high time cost:



execute one retrieval query per
misclassified object
often redundantly sample highly
overlapping areas, spending
resources (i.e., user labeling
effort) without increasing much
AIDE’s accuracy
If k iteration are needed to
identify a relevant area, the user
might have labeled 𝑓 𝑘 samples
without improving the F-measure
PHASE TWO: MISCLASSIFIED EXPLOITATION
Clustering-based Exploitation
Algorithm:
 create clusters using the kmeans algorithm and have one
sampling area per cluster
 sample around each cluster
 In each iteration i, the
algorithm sets k to be the
overall number of relevant
objects discovered in the object
discovery phase.
 we run the clustering based
exploitation only if k is less
than the number of false
negatives
 experimental results showed
that f should be set to a small
number (10-25 samples)
PHASE THREE: BOUNDARY EXPLOITATION
PHASE THREE: BOUNDARY EXPLOITATION




Given a set of relevant areas identified by the
decision tree classifier, this phase aims to refine these
areas by incrementally adjusting their boundaries
better characterization of the user’s interests, i.e.,
higher accuracy of our final results
this phase has the smallest impact on the
effectiveness of our model: not discovering a relevant
area can reduce our accuracy more than a partially
discovered relevant area with imprecise boundaries.
Hence, we constrain the number of samples used
during this phase to 𝛼𝑚𝑎𝑥
aims to distribute an equal amount of user effort to
refine each boundary
PHASE THREE: BOUNDARY EXPLOITATION
Algorithm:
 Input:
𝛼𝑚𝑎𝑥 - number of
samples
 k - d-dimensional
relevant areas
 2𝑑 - number of
boundaries


collect ∝𝑚𝑎𝑥 𝑘 ×2𝑑
random samples
within a distance ±x
from the each
boundary
PHASE THREE: BOUNDARY EXPLOITATION
Optimizations:
1. Adaptive sample size: dynamically adapts the
number of samples collected.

d - dimensionality of the exploration space
𝑗
𝑝𝑐 𝑖−1 - percentage of change of the boundary j between the (i − 1)-th
and i-th iterations
 er - error variable to cover cases where the boundary is not modified
but also not accurately predicted
 𝑝𝑐𝑖−1 - calculated as the difference of the boundary’s normalized
values of the specific dimension

PHASE THREE: BOUNDARY EXPLOITATION
Optimizations:
2. Non-overlapping
Sampling Areas:
In this case, the
exploration areas do not
evolve significantly
between iterations,
resulting in redundant
sampling and increased
exploration cost (e.g.,
user effort) without
improvements on
classification accuracy
PHASE THREE: BOUNDARY EXPLOITATION
Optimizations:
3. Identifying Irrelevant
Attributes:
domain sampling around the
boundaries.
While shrinking/expanding
one dimension of a relevant
area, collect random samples
over the whole domain of the
remaining dimensions
PHASE THREE: BOUNDARY EXPLOITATION
Optimizations:
4. Exploration on Sampled Datasets:
generate a random sampled database and extract our
samples from the smaller sampled dataset
 this optimization can be used for both the
misclassified and the boundary exploitation phases
 generate sampled data sets using a simple random
sampling approach that picks each tuple with the
same probability

AIDE MODEL SUMMARY
Initial Sample Acquisition
 The iterative steering process starts when the
user provides his feedback:


Data Classification

domain experts could restrict the attribute set on which the
exploration is performed
Data Extraction Query
 Space Exploration

Relevant Object Discovery
 Misclassified Exploitation
 Boundary Exploitation



Sample Extraction
Query Formulation
CONCLUSIONS




AIDE assists users in discovering new interesting
data patterns and eliminate expensive ad-hoc
exploratory queries
AIDE relies on a seamless integration of classification
algorithms and data management optimization
techniques that collectively strive to accurately learn
the user interests based on his relevance feedback on
strategically collected samples
Our techniques minimize the number of samples
presented to the user (which determines the amount
of user effort) as well as the cost of sample acquisition
(which amounts to the user wait time)
It provides interactive performance as it limits the
user wait time per iteration of exploration to less than
a few seconds.
ANY QUESTIONS?
AND NOW FOR REAL..

https://www.youtube.com/watch?v=1BwIw_t_J_4