mcdb

Transcript mcdb

Managing Uncertain Data
MCDB:
A MONTE CARLO APPROACH
J A M P A N I (1), X U (1), W U
(1)
(1), P E R E Z (1), J E R M A I N E (1), H A S S (2)
UNIVERSITY OF FLORIDA,
(2)
IBM RESEARCH CENTER
DANIEL BLUM
SDBI
H E B R E W U N I V E R S I T Y, C O M P U T E R S C I E N C E D E P T.
2008
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
Introduction – Uncertainty in Database System
 Types of Uncertainty
 Uncertainty – Can’t determine whether an assertion in the
model is true or false. “The age of John is 35”
 Imprecision – The information is not as specific as it should
be. “The age of John is between 30 and 40”
 Vagueness – Elements in the model are inherently vague.
“John is in early middle age”
 Inconsistency – The model contains two or more assertions
that can’t be true at the same time. “John is 35” and “The age
of John is between 25 and 30”
 Ambiguity – Some elements of the model lack complete
semantics. “John’s salary is 15,000$” (Month? Year?)
Introduction – Uncertainty in Database System
 Categories of Uncertainty
 Description Uncertainty – The actual data which is stored
in the database is uncertain.
 Transaction Uncertainty – Uncertain queries which might
return uncertain data or could update the database with
uncertain data.
 Processing Uncertainty – The algorithms used for
processing might:
Sacrifice accuracy for the sake of simplicity/complexity
 Use statistical model and approximations
 Fail in giving accurate results because a lack of resources during
processing

Introduction – Uncertainty in Database System
 Description Uncertainty

What is affected?

Uncertain data values
The time of a given flight in the database is uncertain.

Tuple uncertainty
There is no actual flight from A to B.

Relation\Structural uncertainty
There is no direct flight from A to B (There are limitations in the
representation of the data in the relation).
Introduction – Uncertainty in Database System
 Description Uncertainty

Source for uncertainty?

Unavailable information
John has a well defined age which isn’t available.

Unreliable information sources
Faulty readings, forms which were filled incorrectly, noise etc.

Information gathering methods
Usage of estimations or approximations.

Model limitations
Database stores only two occupations per employee.
Introduction – Uncertainty in Database System
 Description Uncertainty:

Degrees of uncertainty

Existence
There is uncertainty if the tuple even exists.

Disjunctive – Set of alternatives
An attribute that has a set of alternatives.

Probabilistic
We have a range and probability for each value.
Introduction – Uncertainty in Database System
 Dealing with uncertainty



Null values
First Name
Last Name
Grade
Daniel
Blum
Null
Joint probabilities (ERM – Extended Relational Model)
First Name
Last Name
Grade
Probability
Daniel
Blum
99
0.01
Adding “lineage” in addition to stored probabilities
ID
Name
Course
ID
Course
Grade
ID
Name
Grade
11
Daniel
SDBI
21
SDBI
99
31
Daniel
99
Lineage: (31) = {11,21}
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
Motivation (Problems in Probabilistic Databases)
 Problem 1 – “Hard Wired” Data
ID
Name
Grade
Probability
31
Daniel
99
0.01
ID
Name
Grade
Probability
31
Daniel
99
0.01
Lineage:
(31) = {11,21}
If a new unanticipated manifestation of
uncertainty is found?
Motivation (Problems in Probabilistic Databases)
 Problem 2 – Flexibility and Expressiveness



ID
Name
Grade
Probability
31
Daniel
99
0.01
Lineage:
(31) = {11,21}
How can we update the probability?
What if the we use the a complex statistical model?
Can we dynamically parameterize the uncertainty on the global
state of the database?
What about on the results from queries?
Motivation (Problems in Probabilistic Databases)
 Problem 3 – “Extrapolation Uncertainty”
Product
Price
Q1 Profits
Q2 Profits
Q3 Profits
Q4 Profits
iPod nano
149$
56M$
44M$
20M$
66M$
iPod touch
229$
72M$
78M$
40M$
90M$
MacBook
999$
32M$
40M$
38M$
55M$
Can we use the current state of the database in order
to answer the following question
What would our profits have been in the last 4
quarters if we have raised the prices by 5%?
Motivation (Problems in Probabilistic Databases)
 Problem 3 – “Extrapolation Uncertainty”

How can we solve this problem?

Bayesian approach – Uses a “prior” distribution model for the
customers demands, and create a “posterior” distribution demand for
the new prices

Why is it difficult to implement this analysis in ERM?

This statistical model is quite unique

The parameterization of the model depends on the current state

The posterior distribution for the new demand is complex
We might not be able to represent it in a closed-form
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
Monte Carlo Method
 From Wikipedia
“Monte Carlo methods are a class of computational
algorithms that rely on repeated random sampling
to compute their results”.
 Useful in studying systems with large number of
coupled degrees of freedom.
Example Modeling phenomena with significant
uncertainty in inputs, such as the calculation of risk
in business.
Monte Carlo Method
 Example – Calculating Pi (π)

Draw a square and inscribe a circle within

Uniformly scatter objects of uniform size (rice)

Count the number of objects in the circle, multiply by 4 and
divide by the total number of objects
in the square

The proportion will approximate π/4
Monte Carlo Method
 The General Pattern

Define a domain of possible inputs

Generate inputs randomly from the domain

Perform a deterministic computation using the inputs

Aggregate the results of the individual computations into the
final result
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
MCDB - Introduction
 MCDB – The Monte Carlo Database System
Allows the user to attach arbitrary stochastic models to the
database data in order to “guess” the values for unknown or
inaccurate data.
The models produce multiple possible database instances
(“possible worlds”) and the query is run over each instance
(but in an efficient way…)
MCDB - Introduction
 The Monte Carlo Database System

Doesn’t encode uncertainty within the data model itself.
Instead allows to define arbitrary variable generation (VG)
functions
(This will encode the logics of the Bayesian approach)

Uses parameter tables rather than probabilities, which are easy
to change

The VG functions generate a large number of i.i.d “possible
worlds”, and the query is run over each of them
MCDB - Introduction
 Monte Carlo Benefits

Can deal with several types of queries which are problematic in exact
computation:


Complex query (“EXISTS”, “NOT IN”…) causes significant problems (P#)

Also queries such as “SUM” and ‘AVG” pose significant challenge

It is often unclear how to analyze query output (e.g. quantiles).
Same general purpose methods apply to correlated or uncorrelated
model

Can easily deal with arbitrary continuous distributions, even if there
isn’t a closed-from representation
MCDB - Introduction
 Limitations of MCDB


Exact evaluation of some query results is not feasible

Tuple appearance probabilities

Expected value of an aggregation query
This is caused by the generality of the VG (“black box”) which
makes it hard to quantify the relationship of it to the query
result
MCDB - Introduction
 Two natural concerns to MCDB

Performance


Implementation of an efficient system
Estimation of output results

MCDB merely estimates the output results
“How do we generate or calculate the probabilities in the
probabilistic database in the first place?”
MCDB - Introduction
 Contributions of MCDB

First “pure” Monte Carlo approach toward managing uncertain
data

Powerful and flexible representation of data uncertainty

Provide a syntax which requires small changes to SQL
VG are similar to the User Defined Functions (UDF)

Provide new query processing algorithms that execute a query
plan only once

Acceptable performance according to experiments
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
MCDB – Schema Specification
 Definitions

A relation is “deterministic” if its realization is the same in all
possible worlds, otherwise it is “random”

Each “random” relation is specified by a schema and a set of
VG functions

The output of a query is a probability distribution over possible
answers
MCDB – Schema Specification
 Create-Table and Correlations

An extended version of SQL CREATE TABLE and VG functions
generate relation instances

Each random relation is viewed as a union of blocks of
correlated tuples

Tuples in different blocks are independent
MCDB – Schema Specification
 Syntax – Example (1) “Patient systolic blood pressure”
 Exact values are unavailable, but values are distributed
normally
 Default blood pressure is 100 mm Hg
 We know the mean and standard deviation
 Patients are assumed independent
SBP_PARAM
PATIENTS
Mean
Std
PID
Gender
10
5
1
Female
2
Male
3
Female
MCDB – Schema Specification
 Syntax – Example (2) “Patient systolic blood pressure”
CREATE TABLE SBP_DATA(PID, GENDER, SBP) AS
FOR EACH p in PATIENTS
WITH SBP AS Normal (
One parameter
(SELECT s.MEAN, s.STD
VG function
FROM SBP_PARAM s))
SELECT p.PID, p.GENDER, b.VALUE
FROM SBP b
MCDB – Schema Specification
 Parameterizing VG Functions (1) “Customers expenses”

We know the region in which a customer lives
CID

Region
CUST_ATTRS
For region we associate a probability that a customer lives there
Name

Gender
Region
Prob
CITIES
Money attribute follows a gamma distribution

Shift – Same for all

Scale – Regional

Shape – Customer level
Shift
MONEY_SHIFT
Region
Scale
MONEY_SCALE
CID
Shape
MONEY_SHAPE
MCDB – Schema Specification
 Parameterizing VG Functions (2) “Customers expenses”
CREATE TABLE CUST (CID, GENDER, MONEY, LIVES_IN) AS
FOR EACH d in CUST_ATTRS
WITH MONEY AS Gamma (
(SELECT n.SHAPE
FROM MONEY_SHAPE n
WHERE n.CID = d.CID),
(SELECT sc.SCALE
FROM MONEY_SCALE sc
WHERE sc.REGION = d.REGION),
(SELECT SHIFT
FROM MONEY_SHIFT))
WITH LIVES_IN AS DiscreteChoice (
(SELECT c.NAME, c.PROB
FROM CITIES c
WHERE c.REGION= d.REGION))
SELECT d.CID, d.GENDER, m.VALUE, l.VALUE
FROM MONEY m, LIVES_IN l
Three parameter
VG function
One parameter
VG function
MCDB – Schema Specification
 Capturing ERM Functionality (attribute-value)

Lets look on a variant of the previous example in which each
customer is associated with a set of possible cities
Name

Region
Prob
ID
Name
Prob
A small change in the definition of the 2nd VG and we capture
the ability to represent attribute-value uncertainty
WITH LIVES_IN AS DiscreteChoice (
(SELECT c.NAME, c.PROB
FROM CITIES c
WHERE c.REGION= d.REGION))
WITH LIVES_IN AS DiscreteChoice (
(SELECT c.NAME, c.PROB
FROM CITIES c
WHERE c.CID= d.CID))
MCDB – Schema Specification
 Capturing ERM Functionality (tuple-inclusion)

We can add to the same example from previous slides tupleinclusion uncertainty in the following manner
CID

Gender
Region
CID
Gender
Region INCL_PROB
Add the VG
WITH IN_TABLE AS Bernoulli (VALUES(d.INCL_PROB))

And by adding the following line to the SELECT clause
SELECT d.CID, d.GENDER, m.VALUE, l.VALUE
FROM MONEY m, LIVES_IN l
WHERE i.VALUE = true
MCDB – Schema Specification
 Structural Uncertainty – Fuzzy Queries (1)

Customers locations
LID

City
LOCATIONS
Customers transaction records
SID

Name
Name
Amount
SALES
We define a random table which simulate a fuzzy join
regarding the Name field
MCDB – Schema Specification
 Structural Uncertainty – Fuzzy Queries (2)
CREATE TABLE LS_JOIN (LID, SID) AS
FOR EACH t in (
SELECT l.LID, l.NAME AS NAME1,
s.SIN, s.NAME AS NAME2
FROM LOCATIONS l, SALES s)
WITH JOINS AS Bernoulli (
VALUES(Sim(t.NAME1, t,NAME2)))
SELECT t.LID, t.SID
FROM JOINS j
WHERE j.VALUE = true
Sim returns a value
between 0 and 1
according string
similarity
Bernoulli returns
“true” with
probability p as given
and “false” with (1-p)
MCDB – Schema Specification
 Structural Uncertainty – Fuzzy Queries (3)

Overall result is given by the query
SELECT l.CITY, SUM(s.AMOUNT)
FROM LOCATIONS l, SALES s, LS_JOIN j
WHERE l.TID = j.lid AND s.SID = j.SID
GROUP BY l.CITY
MCDB – Schema Specification
 Correlated Attributes
 If income and city of residence are correlated we can handle
this in the following manner
CREATE TABLE CUST (CID, GENDER, MONEY, LIVES_IN) AS
FOR EACH d in CUST_ATTRS
WITH MLI AS MyJointDistribution (…)
SELECT d.CID, d.GENDER, MLI.VALUE1, MLI.VALUE2
FROM MLI
MCDB – Schema Specification
 Correlated Tuples (1) “Temperature Sensors”
 Each sensor reading as mean of normal probability
distribution
 Sensors are divided into groups according physical closeness
 Stored data
ID
LAD
LONG
ID
MEAN
ID1
ID2
GID
S_PARAMS
MEANS
COVAR
COVARS
MCDB – Schema Specification
 Correlated Tuples (2) “Temperature Sensors”
CREATE TABLE SENSORS(ID, LAD, LONG, TEM c.ID1 P) AS
FOR EACH g in
(SELECT DISTINCT GID FROM S_PARAMS)
MDNoraml is
WITH TEMP AS MDNormal (
invoked once per
(SELECT m.ID, m.MEAN
group, and
FROM MEANS m, S_PARAMS ss
WHERE m.ID=ss.ID AND ss.GID=g.GID), returns multi-row
table having one
(SELECT c.ID1, c.ID2, c.COVER
row per group
FROM COVERS c, S_PARAMS ss
member
WHERE c.ID1=ss.ID AND ss.GID=g.GID),
SELECT s.ID, s.LAD, s.LONG, t.VALUE
FROM S_PARAMS s, TEMP t
WHERE s.ID=t.ID
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
MCDB – VG Functions
 Basic Interface

Implemented in C++

Each VG function has the following four public methods

Initialize() – Seed for the VG function

TakeParams() – Gets the parameters, one vector at a time

OutputVals() – One row returned per call

Finalize() – Deletes the internal VG function data structures
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
MCDB – Inference and Accuracy
 Inference Operator – Definition


In general MCDB returns its query results as a set of (ti, fi)
 ti
– Are the distinct tuples produced in the N MC iterations
 fi
– The fraction of the N possible worlds which tuple ti appears
SELECT SUM(sales) FROM T – Where T is a random table
doesn’t return a fixed number, but a random number
MCDB – Inference and Accuracy
 Accuracy

We can use statistical tools (central limit theorem) in order to
calculate the number of Monte Carlo iterations needed to
estimate E[x] to within a desired precision

We can also perform statistical tests of hypotheses such as
“The expected value of the result Q1 is greater than the
expected value of the result of Q2”
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
MCDB – Queries, Tuples Bundles and Operations
 Query – Processing (Naïve)
 Generate an instance of each random relation as specified by
the various CREATE TABLE statements
 Once an entire instance of the database has been materialized
execute the query Q
 Append every tuple in Q’s answer set with a number
identifying the current Monte Carlo iteration
 Once N different answer sets have been generated, compute
the for each tuple the number of worlds it belongs to
Naïve implementation is impractical – MCDB has other strategy!
MCDB – Queries, Tuples Bundles and Operations
 Query – Processing in MCDB (1)
 MCDB runs each query one time, regardless of N


MCDB delays random attribute materialization


Each tuple in MCDB is actually a “bundle” of tuples where t[i] is
the value of t in the i-th Monte Carlo database instance
They will be materialized right before random attributes are used
by some relational operation
In MCDB values of attributes are reproducible
MCDB ensures that each tuple carries the pseudorandom number
seeds that is supplies to the VG functions
 The seed is a compressed representation of the random attribute
values in the tuple bundle

MCDB – Queries, Tuples Bundles and Operations
 Tuple Bundle – Definition

The only requirement on a set of tuple bundles t1,t2, … tk is
that, for each i the set r 
i-th realization of R

i
t j [i ] corresponds precisely to the

j
A tuple bundle in MCDB may have a special random attribute
called “isPresent”. And t[i].isPres=true iff there is a tuple in the
bundle which actually appears in the i-th MC instance. If it is
not explicitly represented it assumed to be true for all i.
This is implemented as a binary array
MCDB – Queries, Tuples Bundles and Operations
 New Operations – Introduction

Because we use the tuple-bundle notation we need to add some
additional operations in order to handle different situations

The classical relational operations must be modified slightly to
handle the fact that tuple bundles move through the query plan
MCDB – Queries, Tuples Bundles and Operations
 New Operations – Seed Operator

For a given random table R and VG function V, the SEED
operator appends each tuple created by R’s FOR EACH
statement an integer unique to the (tuple, VG) pair

This serves the pseudorandom seed for V when expanding the
tuple into uncompressed tuple bundle
MCDB – Queries, Tuples Bundles and Operations
 New Operations – Instantiate Operator (1)

The most unique and fundamental operator used by MCDB

For a random table R, this operator uses a VG function to
generate a set of attribute values corresponding to a MC
iteration – which is appended to individual tuple bundles in R.

Lets follow an example…
MCDB – Queries, Tuples Bundles and Operations
 New Operations – Instantiate Operator (2)
 The operator gets the following 7 parameters from the CREATE
TABLE statement
 Qout – The answer set for the FOR EACH
 VG – The VG function
 VG Atts – Set of attributes produced by the VG (Normal.VALUE)
 Out Atts – Set of attributes in the result (p.ID and p.GENDER)
 Qin1, Qin2,… - The answer sets for the inner queries for the VG
(SELECT s.MEAN…s.GENDER FROM SBP_PARAM)
 In Atts1, In Atts2,… - Set of attributes from the i-th inner query
(s.MEAN and s.STD)
 B1, B2, … - The boolean join (s.GENDER = p.GENDER)
MCDB – Queries, Tuples Bundles and Operations
 New Operations – Instantiate Operator (3)
MCDB – Queries, Tuples Bundles and Operations
 New Operations – Split Operator (1)

In case that one of the attributes in the tuple-bundle is not
constant (equal on all instances) we need to have a split
operator for being able to use other operators

The SPLIT takes a tuple-bundle and returns a new set of tuple
bundles in which the given attributes are constant
MCDB – Queries, Tuples Bundles and Operations
 New Operations – Split Operator (2)
Fname

Lname
age
Lets say we have 4 MC instances in which Fname and Lname
are constant, only the age isn’t

In i=1,3 age=20 and in i=2,4 age=21.


T = (Jane, Smith, (20,21,20,21), (T,T,T,T))
After the SPLIT on where atts={age}

T1 = (Jane, Smith, 20, (T,F,T,F))

T2 = (Jane, Smith, 20, (F,T,F,T))
MCDB – Queries, Tuples Bundles and Operations
 New Operations – Inference Operator

This operator returns a set of distinct unbundled tuples, and
for each the fraction of the MC iterations for which it appears

This is done in the following way

Call SPLIT.

Call Duplicate Removal (in a few slides)

Counts the number of i values for which t[i].isPres=true.
MCDB – Queries, Tuples Bundles and Operations
 Standard Relational Operations – Selection

Given a boolean relational selection predicate B and a tuple
bundle t:

For each i

t[i].isPres = B(t[i]) and t[i].isPres
 Standard Relational Operations – Projection

If a non-constant attribute is projected away, the entire array
of values for that attribute is removed.
MCDB – Queries, Tuples Bundles and Operations
 Standard Relational Operations – Join


Cartesian Product

t[i] = r[i] s[i] (Tuple concatenation)

t[i].isPres = r[i].isPres and s[i].isPres
Join

Cartesian Product

Applying the modified Selection operator
MCDB – Queries, Tuples Bundles and Operations
 Standard Relational Operations – Duplicate Removal

Call SPLIT

Sort lexicographically according attribute values

For each group (T) one result tuple is output

t[i].isPres = V(t’ in T) t’[i].isPres
Outline
 Introduction – Uncertainty in Database Systems
 Motivation (Problems in Probabilistic Databases)
 Monte Carlo Method
 MCDB
 Introduction
 Schema Specification
 VG Functions
 Inference and Accuracy
 Queries, Tuple Bundles and Operations
 Experiments and Conclusions
MCDB – Experiments and Conclusions
 Experiment – Setup

20,000 lines of C++ code

20GB instance of TCP-H database

3,000$ server with 4 160GB ATA HDs and 8 2.0GHz cores
partitioned over 2 CPUs.
8 GB RAM

Ubuntu Linux OS
MCDB – Experiments and Conclusions
 Experiment – Query 1
 This query guesses the revenue gain for products
supplied by Japanese companies next year (assumed to
be 1996), assuming that current sales trends hold. The ratio of
sales volume in 1995 to 1994 is first computed on a percustomer basis.
 Then the 1996 sales are generated by replicating each 1995
order a random number of times, according to a Poisson
distribution with mean .
 This process approximates a “bootstrapping” re-sampling
scheme.
 Once 1996 is generated, the additional revenue is computed.
MCDB – Experiments and Conclusions
 Experiment – Query 2
 This query estimates the number of days until all
orders that were placed today are delivered.
 Using past data, the query computes the mean and variance of
both time-to-shipment and time-to-delivery for each part.
 For each order placed today, instances of these two random
delays are generated according to discretized gamma
distributions with the computed means and variances.
 Once all of the times are computed for each component of each
order, the maximum duration is selected.
MCDB – Experiments and Conclusions
 Experiment – Query 3
 One shortcoming of the TPC-H schema is that, for a given
supplier and part, only the current price is maintained
in the database.
 Thus, it is difficult to ask, “What would the total amount
paid to suppliers in 1995 have been if we had always
gone with the most inexpensive supplier?” Query Q3
starts with the current price for each item from each supplier
and then performs a random walk to guess prices from
December, 1995 back to January, 1995.
MCDB – Experiments and Conclusions
 Experiment – Query 4

This query is the one mentioned at the beginning of the
lecture, which estimates the effect of a 5% customer price
increase on an organization’s profits.
MCDB – Experiments and Conclusions
 Experiment – Results
MCDB - Experiments and Conclusions
 Contributions of MCDB

First “pure” Monte Carlo approach toward managing uncertain
data

Powerful and flexible representation of data uncertainty

Provide a syntax which requires small changes to SQL
VG are similar to the User Defined Functions (UDF)

Provide new query processing algorithms that execute a query
plan only once

Acceptable performance according to experiments
MCDB – Experiments and Conclusions
 Future Work

Query optimization (improving the optimizer)

Error control (currently user selects the number of iterations)

Improved risk assessment (estimating quantiles)

Correlated relations (finding the best way to handle)

Lineage (explicitly)

Non-relational applications (Uncertain XML)
Thank You
“Gambling: The sure way of getting
nothing from something”
~Wilson Mizner
Resources
 The Paper:
 MCDB: a Monte Carlo approach to managing uncertain data
 Management of Uncertainty in Database Systems

http://cs.gmu.edu/~ami/research/publications/pdf/modern94.pdf
 Other Resources:
 http://en.wikipedia.org/wiki/Monte_Carlo_method
 http://www.tpc.org/tpch/default.asp
 http://portal.acm.org/citation.cfm?doid=1376616.1376686
 http://www.tpc.org/tpch/default.asp
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.7
320

mcdb

Transcript mcdb

Directory