Transcript Talk Slides

Prediction Cubes
Bee-Chung Chen, Lei Chen,
Yi Lin and Raghu Ramakrishnan
University of Wisconsin - Madison
Subset Mining

We want to find interesting subsets of the
dataset



Interestingness: Defined by the “model” built on a
subset
Cube space: A combination of dimension attribute
values defines a candidate subset (just like regular
OLAP)
We want the measures to represent
decision/prediction behavior


Summarize a subset using the “model” built on it
Big change from regular OLAP!
2
The Idea

Build OLAP data cubes in which cell values
represent decision/prediction behavior



In effect, build a tree for each cell/region in the
cube—observe that this is not the same as a
collection of trees used in an ensemble method!
The idea is simple, but it leads to promising data
mining tools
Ultimate objective: Exploratory analysis of the
entire space of “data mining choices”

Choice of algorithms, data conditioning parameters …
3
Example (1/7): Regular OLAP
Z: Dimensions Y: Measure
Goal: Look for patterns of unusually
high numbers of applications:
Location
Time
# of App.
…
AL, USA
…
…
Dec, 04
…
...
2
…
WY, USA
Dec, 04
3
Location
All
Country
State
Time
All
All
Japan
USA
AL
Norway
WY
Year
Month
All
85
86
Jan., 86
04
Dec., 86
4
Example (2/7): Regular OLAP
Goal: Look for patterns of unusually
high numbers of applications:
Coarser
regions
CA
USA
…
04
03
…
100
80
…
90
90
…
…
…
…
Roll up
2004
2003
…
Jan … Dec Jan … Dec …
CA
USA
…
30
70
…
20
2
…
50
8
…
25
10
…
30
…
…
…
…
…
Drill
down
…
…
…
Cell value: Number of loan applications
Z: Dimensions Y: Measure
Location
Time
# of App.
…
AL, USA
…
…
Dec, 04
…
...
2
…
WY, USA
Dec, 04
3
CA
USA
…
AB
…
YT
AL
…
WY
…
Jan
2004
…
Dec
20
5
5
55
5
10
…
15
2
3
…
…
…
…
15
20
15
…
…
…
…
Finer regions
…
…
…
…
…
…
…
…
5
Example (3/7): Decision Analysis
Goal: Analyze a bank’s loan decision process
w.r.t. two dimensions: Location and Time
Fact table D
Z: Dimensions X: Predictors Y: Class
Location
Time
AL, USA
…
Dec, 04
…
White
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
Black
F
…
No
Cube subset
Race Sex … Approval
Model h(X, Z(D))
E.g., decision tree
Location
All
Country
State
Time
All
All
Japan
USA
AL
Norway
WY
Year
Month
All
85
86
Jan., 86
04
Dec., 86
6
Example (3/7): Decision Analysis

Are there branches (and time windows)
where approvals were closely tied to
sensitive attributes (e.g., race)?


Suppose you partitioned the training data by
location and time, chose the partition for a given
branch and time window, and built a classifier.
You could then ask, “Are the predictions of this
classifier closely correlated with race?”
Are there branches and times with decision
making reminiscent of 1950s Alabama?

Requires comparison of classifiers trained using
different subsets of data.
7
Example (4/7): Prediction Cubes
2004
2003
…
Jan
…
Dec Jan
…
Dec
…
CA
0.4
0.8
0.9
0.8
…
…
USA
0.2
0.3
0.5
…
…
…
…
…
…
…
…
…
…
0.6
…
1. Build a model using data
from USA in Dec., 1985
2. Evaluate that model
Data [USA, Dec 04](D)
Location
AL ,USA
…
Time
Race
Dec, 04 White
…
…
WY, USA Dec, 04 Black
Sex
…
Approval
M
…
Y
…
…
…
F
…
N
Measure in a cell:
• Accuracy of the model
• Predictiveness of Race
measured based on that
model
• Similarity between that
model and a given model
Model h(X, [USA, Dec 04](D))
E.g., decision tree
8
Example (5/7): Model-Similarity
Given:
- Data table D
- Target model h0(X)
- Test set  w/o labels
Data table D
Location
Time
AL, USA
…
Dec, 04
…
White
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
Black
F
…
No
Race Sex … Approval
2004
2003
…
Jan … Dec Jan … Dec …
CA 0.4 0.2
USA 0.2 0.3
… … …
0.3
0.6
0.9
…
…
0.5
…
…
…
…
…
…
…
…
Level: [Country, Month]
Build a model
Similarity
The loan decision process in USA during Dec 04 h0(X)
was similar to a discriminatory decision model
Race
Sex
…
White
…
F
…
Yes
…
Yes
…
…
…
Black
M
No
…
Yes
Test set 
9
Example (6/7): Predictiveness
Given:
- Data table D
- Attributes V
- Test set  w/o labels
Data table D
Location
Time
AL, USA
…
Dec, 04
…
White
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
Black
F
…
No
Race Sex … Approval
2004
2003
…
Jan … Dec Jan … Dec …
CA 0.4 0.2
USA 0.2 0.3
… … …
0.3
0.6
0.9
…
…
0.5
…
…
…
…
…
…
…
…
Level: [Country, Month]
Yes
No
.
.
Yes
h(X)
Yes
No
.
.
No
Build models
h(XV)
Predictiveness of V
Race was an important predictor of loan
approval decision in USA during Dec 04
…
Race Sex
White
…
F
…
…
…
Black
M
…
Test set 
10
Example (7/7): Prediction Cube
2004
2003
…
Roll up
04
03
…
Jan
…
Dec Jan
…
Dec
…
CA
0.3
0.2
…
CA
0.4
0.1
0.3
0.6
0.8
…
…
USA
0.2
0.3
…
USA
0.7
0.4
0.3
0.3
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Cell value: Predictiveness of Race
CA
Drill down
USA
…
2004
2003
…
Jan
…
Dec
Jan
…
Dec
…
AB
0.4
0.2
0.1
0.1
0.2
…
…
…
0.1
0.1
0.3
0.3
…
…
…
YT
0.3
0.2
0.1
0.2
…
…
…
AL
0.2
0.1
0.2
…
…
…
…
…
0.3
0.1
0.1
…
…
…
WY
0.9
0.7
0.8
…
…
…
…
…
…
…
…
…
…
…
…
11
Efficient Computation

Reduce prediction cube computation to
data cube computation

Represent a data-mining model as a
distributive or algebraic (bottom-up
computable) aggregate function, so that
data-cube techniques can be directly
applied
12
Bottom-Up Data Cube
Computation
1985
1986
1987
1988
47
107
76
67
1985
1986
1987
1988
Norway
10
30
20
24
Norway
84
…
23
45
14
32
…
114
USA
14
32
42
11
USA
99
All
All
All
297
All
Cell Values: Numbers of loan applications
13
Functions on Sets



Bottom-up computable functions: Functions that can
be computed using only summary information
Distributive function: (X) = F({(X1), …, (Xn)})
 X = X1  …  Xn and Xi  Xj =
 E.g., Count(X) = Sum({Count(X1), …, Count(Xn)})
Algebraic function: (X) = F({G(X1), …, G(Xn)})
 G(Xi) returns a length-fixed vector of values
 E.g., Avg(X) = F({G(X1), …, G(Xn)})
 G(Xi) = [Sum(Xi), Count(Xi)]
 F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci})
14
Scoring Function


Represent a model as a function of sets
Conceptually, a machine-learning model h(X;
Z(D)) is a scoring function Score(y, x; Z(D))
that gives each class y a score on test
example x



h(x; Z(D)) = argmax y Score(y, x; Z(D))
Score(y, x; Z(D))  p(y | x, Z(D))
Z(D): The set of training examples (a cube
subset of D)
15
Bottom-up Score Computation

Key observations:


Observation 1: Score(y, x; Z(D)) is a function of
cube subset Z(D); if it is distributive or algebraic,
the data cube bottom-up technique can be
directly applied
Observation 2: Having the scores for all the test
examples and all the cells is sufficient to compute
a prediction cube


Scores  predictions  cell values
Details depend on what each cell means (i.e., type of
prediction cubes); but straightforward
16
Machine-Learning Models

Naïve Bayes:


Kernel-density-based classifier:


Scoring function: distributive
Decision tree, random forest:


Scoring function: algebraic
Neither distributive, nor algebraic
PBE: Probability-based ensemble (new)


To make any machine-learning model distributive
Approximation
17
Probability-Based Ensemble
Decision tree on [WA, 85]
PBE version of decision
tree on [WA, 85]
1985
Jan
…
W
…
A
…
…
1985
Dec
Jan
…
Dec
…
W
…
A
…
Decision trees built on
the lowest-level cells 18
Probability-Based Ensemble

Scoring function:
hPBE ( x; S (D))  arg max y ScorePBE ( y, x; S (D))
ScorePBE ( y, x; bi (D))  h( y | x; bi (D))  g (bi | x)
ScorePBE ( y, x; S (D)) 



iS
ScorePBE ( y, x; bi (D))
h(y | x; bi(D)): Model h’s estimation of p(y | x,
bi(D))
g(bi | x): A model that predicts the probability that
x belongs to base subset bi(D)
19
Outline





Motivating example
Definition of prediction cubes
Efficient prediction cube
materialization
Experimental results
Conclusion
20
Experiments

Quality of PBE on 8 UCI datasets

The quality of the PBE version of a model is
slightly worse (0 ~ 6%) than the quality of the
model trained directly on the whole training data.
PBE


…
W
A
…
…
1985
1985
…
…
vs.
…
W
A
…
…
Efficiency of the bottom-up score computation
technique
Case study on demographic data
21
Efficiency of Bottom-up Score
Computation

Machine-learning models:





J48: J48 decision tree
RF: Random forest
NB: Naïve Bayes
KDC: Kernel-density-based classifier
Bottom-up method vs. Exhaustive method
 PBE-J48
 PBE-RF
 NB
 KDC
 J48ex
 RFex
 NBex
 KDCex
22
Synthetic Dataset

Dimensions: Z1, Z2 and Z3.
All
All
Z1 and Z2
A
B
C
D
E
0 1 2 3 4 5 6 7 8 9

0
1
Z3
n
Decision rule:
Condition
Rule
When Z1>1
Y = I(4X1+3X2+2X3+X4+0.4X6 > 7)
else when Z3 mod 2 = 0
Y = I(2X1+2X2+3X3+3X4+0.4X6 > 7)
else
Y = I(0.1X5+X1>1)
23
Efficiency Comparison
2500
RFex
Execution Time (sec)
KDCex
2000
NBex
1500
Using exhaustive
method
J48ex
NB
1000
KDC
500
0
40K
RFPBE
J48PBE
80K
120K
160K
# of Records
Using bottom-up
score computation
200K
24
Related Work: Building models
on OLAP Results

Multi-dimensional regression [Chen, VLDB 02]



Goal: Detect changes of trends
Build linear regression models for cube cells
Step-by-step regression in stream cubes [Liu,
PAKDD 03]

Loglinear-based quasi cubes [Barbara, J. IIS 01]
Use loglinear model to approximately compress
dense regions of a data cube
 NetCube [Margaritis, VLDB 01]
 Build Bayes Net on the entire dataset of approximate
answer count queries

25
Related Work (Contd.)

Cubegrades [Imielinski, J. DMKD 02]
Extend cubes with ideas from association rules
 How does the measure change when we rollup or
drill down?
 Constrained gradients [Dong, VLDB 01]
 Find pairs of similar cell characteristics associated
with big changes in measure


User-cognizant multidimensional analysis
[Sarawagi, VLDBJ 01]
 Help users find the most informative unvisited
regions in a data cube using max entropy principle

Multi-Structural DBs [Fagin et al., PODS 05,
VLDB 05]
26
Take-Home Messages

Promising exploratory data analysis
paradigm:


Can use models to identify interesting subsets
Concentrate only on subsets in cube space



Those are meaningful subsets, tractable
Precompute results and provide the users with an
interactive tool
A simple way to plug “something” into cubestyle analysis:

Try to describe/approximate “something” by a
distributive or algebraic function
27
Big Picture



Why stop with decision behavior? Can apply to
other kinds of analyses too
Why stop at browsing? Can mine prediction
cubes in their own right
Exploratory analysis of mining space:


Dimension attributes can be parameters related to
algorithm, data conditioning, etc.
Tractable evaluation is a challenge:
 Large number of “dimensions”, real-valued
dimension attributes, difficulties in compositional
evaluation
 Active learning for experiment design, extending
compositional methods
28
Community Information
Management (CIM)
Anhai Doan
University of Illinois at Urbana-Champaign
Raghu Ramakrishnan
University of Wisconsin-Madison
UI
Structured Web-Queries

Example Queries:




UI
How many alumni are top-10 faculty members?
 Wisconsin does very well, by the way
Find trends in publications
 By topic, by conference, by alumni of schools
Change tracking
 Alert me if my co-authors publish new papers
or move to new jobs
Information is extracted from text sources on
the web, then queried
30
Key Ideas

Communities are ideally scoped chunks of the
web for which to build enhanced portals



UI
Relative uniformity in content, interests
Can exploit “people power” via mass collaboration,
to augment extraction
CIM platform: Facilitate collaborative creation
and maintenance of community portals

Extraction management


Uncertainty, provenance, maintenance, compositional
inference for refining extracted information
Mass collaboration for extraction and integration
Watch for new DBWorld!
31
Challenges

UI
User Interaction



Declarative specification of background
knowledge and user feedback
Intelligent prompting for user input
Explanation of results
32
Challenges

UI
Extraction and Query Plans



Starting from user input (ER schema, hints) and
background knowledge (e.g., standard types,
look-up tables), compile a query into an
execution plan
Must cover extraction, storage and indexing, and
relational processing
 And maintenance!
Algebra to represent such plans? Query
optimizer?

Handling uncertainty, constraints, conflicts, multiple
related sources, ranking, modular architecture
33
Challenges

UI
Managing extracted data




Mapping between extracted metadata
and source data
Uncertainty of mapping
Conflicts (in user input, background
knowledge, or from multiple sources)
Evolution over time
34