Transcript ppt slides

Robust Estimation With Sampling
and Approximate Pre-Aggregation
Author: Christopher Jermaine
Presented by: Bill Eberle
Overview






Issues and Questions
Approximate Pre-Aggregation (APA)
Maximum Likelihood Estimation
Performing the Estimation
Experiments
Observations
Issues

Assumes numerical or discretized attributes



Many data warehouse (for DSS) attributes are
categorical (ex. country = France)
Can not handle too many dimensions
Stratified sampling (used to handle variance
in values over which aggregation is
performed)


Assumes workload is known
Targeting particular queries means other queries
suffer
Questions




Can we decrease the effect of random
sampling variance on categorical (or mixed)
data?
Can we still require only one pass through the
data?
Can we increase the accuracy of the
estimation for all aggregate queries?
Can we do this without prior knowledge of
the query workload?
Approximate Pre-Aggregation


Uses a true random sampling that is not
biased towards any query or workload.
Sample is combined with small set of
statistics about the data that is
gathered at same time that sampling is
performed.
Approximate Pre-Aggregation
(example)

Using a sample of 50% of the data (unshaded) to answer the query:
select SUM(cmplaints)
from sample
where prof = ‘Smith’
we get 36, and since the sample constitutes 50% of the database tuples, we
estimate that 72 students complained about Professor Smith.



The actual number is 121 – a relative error of 40.5%!
The issue is in the variance of students who complained each semester.
APA uses some additional information to decrease this error.
Approximate Pre-Aggregation
(cont.)


Suppose we also know that the total number of complaints in the entire table is
148, and the total number of database tuples is 16.
Use the selection predicate (i.e. the “where” clause), and divide the data space
into 2n quadrants.



For each of the quadrants, estimate the probability density function (pdf).
Use additional information to create constraints on the distributions.


In this example, we know that the number of complaints is 148 and the number of
tuples is 16, which gives us a mean of 148/16 = 9.25
Then need to resolve any errors in the mean.


In this example, n=1 (prof=‘Smith’), so there will be two quadrants: prof=‘Smith’ and
prof<>’Smith’.
In this example, the mean of the “prof=‘Smith’” quadrant is 4.5 and the mean of the
“prof<>’Smith’” quadrant is 1.25, for a total mean of 5.75 – significantly different from
9.25.
Use the Maximum Likelihood Estimation (MLE) to re-estimate the mean.

In this example, the mean of complaints for Professor Smith now becomes 8.52
(instead of 4.5), which gives us an estimated total of 136.3 (8.52 * 16)… much closer
to the actual 121 than 72.
Maximum Likelihood
Estimation

Let x be an observable outcome from a given experiment.


In our example, x might be the fact that our random sample
predicts that the value of our relational aggregate SUM query is
72.
Let the pdf with respect to the observing outcome x be the
following function, where the parameters are the hidden
parameters that we wish to estimate (they describe the
model we want to discover):
f ( x;1 , 2 ,..., k )
Maximum Likelihood
Estimation (cont.)

In order to “fit” the hidden model
parameters to the observation (or
observations) we maximize the
likelihood that our particular model
produced the data (loglikelihood):
n
   log f ( xi ;1 ,  2 ,...,  k )
i 1
Maximum Likelihood
Estimation (in APA)



Basic idea of APA is simply to find the best (or most likely)
explanation for the sample which does not violate any of the
known facts about the database.
Need to pose problem of approximate aggregation over
categorical data as a MLE problem.
Three specific components of maximum likelihood that we need
to describe in the context of APA:




The experimental outcomes x1,x2,…,xn
The model parameters
The pdf
Once we these three components have been defined, we have
transformed the problem of estimation of aggregate functions
over categorical data into a MLE problem, and we can begin the
task of developing a method to solve the problem.
Maximum Likelihood
Estimation (outcomes)


First, we describe how we obtain the
“outcomes” x1,x2,…,xn to postulate APA as a
MLE problem. In APA, those outcomes are a
set of predictions made by our sample.
Suppose we have the following aggregate
query:
select SUM(salary)
from Employee
where sex=‘M’ and dept=‘accounting’ and
job_type=‘supervisor’

We can number each of the clauses in the
relational selection predicate from 1 to m. In
this case:



b1=(sex=‘male’), b2=(dept=‘accounting’) and
b3=(job_type=‘supervisor’)
Also the negation of each of the clauses
Conceptually, the result is a data cube (which
we will call 2m). Each of the boolean
conditions corresponds to a single cell in the
multidimensional data cube.
Maximum Likelihood
Estimation (outcomes, cont.)


The outcomes x1,x2,…,x2m for the MLE in APA are the results of
the aggregate function in question with respect to each of the
relational selection predicates in 2m, estimated using our
sample.
Examples:



We know that our sample-based estimate for SUM(salary) over b1
^ b2 ^ b3 is $1.5M. b1 ^ b2 ^ b3 is the first entry in 2m. This,
$1.5M is used as the value of x1.
We know that our sample-based estimate for SUM(salary) over b1
^ ~b2 ^ ~b3 is $1.1M. This is the fourth entry in 2m, thus x4 is
$1.1M.
In this way, the random sample is used to estimate the value for
each cell in the cube, and these values become the outcomes
x1,x2,…,x2m.
Maximum Likelihood
Estimation (parameters)




Second, we need to describe the set of model parameters which we
will attempt to estimate.
In APA, these parameters are defined to be the APA guess as to the
real value of the aggregate function in question, with respect to each
of the cells in the multidimensional data cube.
If xi is the value of cell i predicted by the sample, then the model
parameter 0i is the APA maximum likelihood estimate for the correct
value for the aggregate function applied to cell i.
Example:

Assume we know the fact:
(SUM(salary) where job_type!=‘supervisor’) = $2.3M


The relational selection predicate in this fact is (b1^b2^~b3) v
(b1^~b2^~b3) v (~b1^b2^~b3) v (~b1^~b2^~b3). This is a disjunction of
the third, fourth, seventh and eight predicates present in 2m.
Thus, this fact is equivalent to the constraint that 03+04+07+08 = $2.3M.
Maximum Likelihood
Estimation (pdf)


Finally, we need to define the probability density function f, which will give us
the likelihood that we would see the experimental observations x1,x2,…,x2m,
given model parameters 01,02,…,02m.
Example: Assume we attempt to estimate the value for an aggregate of the
form:
select AGG(expression)
from TABLE
where (predicate)
Assume we have a sample size of n from a database of size db, and the estimated
value of the query based on the sample is z… (Hellerstein and Haas)

To make a long story short, the pdf is defined as follows:
f ( x; ) 
 n 
1

exp 

2
2
T
(
v
)


Performing the Estimation



Many ways to attempt to estimate the solution to a maximum
likelihood problem.
Usually necessary to approximate or estimate because of the
inherent intractability of discovering the most likely model in the
general case.
Best known is the Expectation Maximization algorithm (EM).



Begins with an initial guess at the solution and then repeatedly
refines the guess until it reaches a locally optimal solution
EM simply seeks to maximize the loglikelihood value, while APA
needs to maximize within the constraints of the model
parameters.
Instead, APA uses quadratic programming formulation.
Performing the Estimation
(quadratic programming)




Quadratic programming is an extension of linear
programming with the generalization that the
objective function to maximize may contain products
of two variables, and not simply scalars.
A key advantage of this is that many algorithms have
been developed that efficiently solve problems posed
as quadratic equations.
The ability of quadratic programming to incorporate
constraints makes it ideal for APA.
The constraints for this quadratic programming
formulation are simply the linear sums of the values
01,02,…,02B.
Experiments

Six different approximate options, over eight real, high-dimensional
data sets, were performed:








Random sampling
Stratified sampling
APA0: store and use all “0-dimensional” facts as constraints in the quadratic
programming.
APA1: same as APA0, as well as “1-dimensional” facts.
APA2: same as APA1, as well as “2-dimensional” facts.
APA3: same as APA2, as well as “3-dimensional” facts.
Wavelets
Results from AVG aggregation:
Observations






Wavelets are unsuitable in this domain.
Random sampling better for COUNT queries.
Additional accuracy of APA2 and APA3 probably not worth the
overhead; also are impractical for numerical data as that would
require joint distributions of numerical and categorical attributes
(difficult).
APA0 and APA1 can easily be extended to handle numerical
attributes.
Should be possible to easily extend APA to work across foreign
key joins, using a technique for sampling from the results of
joins (ex. join synopses).
Largely sidestepped issues associated with computational
efficiency.