OLAP Over Uncertain and Imprecise Data

Download Report

Transcript OLAP Over Uncertain and Imprecise Data

OLAP Over Uncertain and
Imprecise Data
T.S. Jayram (IBM Almaden)
with Doug Burdick (Wisconsin), Prasad Deshpande
(IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar
Vaithyanathan (IBM)
Dimensions in OLAP
Automobile
All
Truck
Sedan
Civic
Camry
F150
Sierra
Location
All
East
West
CA
TX
NY
MA
Measures, Facts, and Queries
Automobile
Truck
Sedan
F150
MA
Camry
p2
NY
Civic
p1
Sierra
TX
CA
West
ALL
Location
East
Auto = F150
Loc = NY
Repair = $200
ALL
Auto = Truck
Loc = East
SUM(Repair) = ?
p7
p6
p8
p4
p5
p3
Cell
Extend the OLAP model to handle data
ambiguity
Imprecision
Uncertainty
Imprecision
Auto = F150
Loc = East
Repair = $200
Automobile ALL
Truck
Sedan
Civic
Camry
F150
Sierra
MA
p9
p11
p1
TX
CA
West
ALL
Location
NY
East
p2
p7
p6
p8
p10
p5
p3
p4
Representing Imprecision using Dimension
Hierarchies
Dimension hierarchies lead to a natural
space of “partially specified” objects
Sources of imprecision: incomplete data,
multiple sources of data
Motivating Example
Query: COUNT
Truck
F150
p4
MA
p3
Sierra
NY
East
We propose desiderata
that enable
p5
appropriate definition of query
semantics for imprecise data
p1
p2
Desideratum I: Consistency
Truck
F150
Sierra
p4
p5
NY
East
MA
p3
p1
p2
Consistency
specifies the
relationship
between answers
to related queries
on a fixed data
set
Desideratum II: Faithfulness
p3
MA
p5
F150
p4
p1
p2
Sierra
p5
p3
Data Set 3
F150
p4
p1
p2
Sierra
p5
MA
Sierra
NY
NY
MA
F150
Data Set 2
p4
p3
NY
Data Set 1
p1
Faithfulness specifies the relationship between
answers to a fixed query on related data sets
p2
Formal definitions of both Consistency and
Faithfulness depend on the underlying
aggregation operator
Can we define query semantics that
satisfy these desiderata?
F150
Query
Semantics
p1
p2
F150
Possible
Worlds
MA
F150
p5
[Kripke63,…]
p3
p1
MA
NY
w2
p2
p3
Sierra
p5 p4
w4
w3
Sierra
p4
p2
F150
MA
p1
w1
p4
NY
p3
p5
NY
p4
p3
NY
MA
p5
Sierra
NY
MA
F150
Sierra
Sierra
p5 p4
p3
p1
p2
p1
p2
Possible Worlds Query Semantics
Given all possible worlds together with
their probabilities, queries are easily
answered (using expected values)
But number of possible worlds is
exponential!
Allocation
Allocation gives facts weighted assignments to
possible completions, leading to an extended
version of the data
Size increase is linear in number of (completions of)
imprecise facts
Queries operate over this extended version
Key contributions:
Appropriate characterization of the large space of
allocation policies
Designing efficient allocation policies that take into
account the correlations in the data
Storing Allocations using Extended Data
Model
Truck
F150
Sierra
NY
East
MA
p5
p4
p3
p1
p2
ID
FactID
Auto
Loc
Repair
Weight
1
1
F150
NY
100
1.0
2
2
Sierra
NY
500
1.0
3
3
F150
MA
150
0.6
4
3
F150
NY
150
0.4
5
4
Sierra
MA
200
1.0
6
5
F150
MA
100
0.5
7
5
Sierra
MA
100
0.5
Classifying Allocation Policies
Measure Correlation
Ignored
Used
Dimension
Correlation
Ignored
Used
Uniform
Count
EM
Results on Query Semantics
Evaluating queries over extended version of
data yields expected value of the aggregation
operator over all possible worlds
intuitively, the correct value to compute
Efficient query evaluation algorithms for SUM,
COUNT
consistency and faithfulness for SUM, COUNT are
satisfied under appropriate conditions
Dynamic programming algorithm for AVERAGE
Unfortunately, consistency does not hold for
AVERAGE
Alternative Semantics for AVERAGE
APPROXIMATE AVERAGE
E[SUM] / E[COUNT] instead of
E[SUM/COUNT]
simpler and more efficient
satisfies consistency
extends to aggregation operators for
uncertain measures
Uncertainty
Measure value is modeled as a probability
distribution function over some base domain
e.g., measure Brake is a pdf over values {Yes,No}
sources of uncertainty: measures extracted from text
using classifiers
Adapt well-known concepts from statistics to
derive appropriate aggregation operators
Our framework and solutions for dealing with
imprecision also extend to uncertain measures
Summary
Consistency and faithfulness
desiderata for designing query semantics for
imprecise data
Allocation is the key to our framework
Efficient algorithms for aggregation
operators with appropriate guarantees of
consistency and faithfulness
Iterative algorithms for allocation policies
Correlation-based Allocation
Involves defining an objective function to capture
some underlying correlation structure
a more stringent requirement on the allocations
solving the resulting optimization problem yields the
allocations
EM-based iterative allocation policy
interesting highlight: allocations are re-scaled
iteratively by computing appropriate aggregations